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Abstract 

High dimensional statistics deals with the challenge of extracting structured informa¬ 
tion from complex model settings. Compared with the growing number of frequentist 
methodologies, there are rather few theoretically optimal Bayes methods that can deal 
with very general high dimensional models. In contrast, Bayes methods have been exten¬ 
sively studied in various nonparametric settings and rate optimal posterior contraction 
results have been established. This paper provides a unified approach to both Bayes high 
dimensional statistics and Bayes nonparametrics in a general framework of structured 
linear models. With the proposed two-step model selection prior, we prove a general 
theorem of posterior contraction under an abstract setting. The main theorem can be 
used to derive new results on optimal posterior contraction under many complex model 
settings including stochastic block model, graphon estimation and dictionary learning. It 
can also be used to re-derive optimal posterior contraction for problems such as sparse 
linear regression and nonparametric aggregation, which improve upon previous Bayes 
results for these problems. The key of the success lies in the proposed two-step prior 
distribution. The prior on the parameters is an elliptical Laplace distribution that is 
capable to model signals with large magnitude, and the prior on the models involves an 
important correction factor that compensates the effect of the normalizing constant of 
the elliptical Laplace distribution. 

Keywords. Oracle inequality, Stochastic block model, Graphon, Sparse linear re¬ 
gression, Aggregation, Dictionary learning, Posterior contraction 


1 Introduction 

Theory for posterior distribution has been extensively investigated in Bayes nonparametrics 
recently. Important works such as [6, 5, 24, 25, 48, 53, 27, 13] established that the posterior 
distribution contracts to a small neighborhood of the truth under proper conditions on like¬ 
lihood functions and priors. These works bridge the gap between frequentist and Bayesian 
views of statistics from a fundamental perspective. 
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Despite the success of theoretical advancements of Bayes nonparametrics, there are not 
many theories developed for Bayes high dimensional statistics. A few exceptions are [14] on 
sparse Gaussian sequence model, [4] on bandable precision matrix estimation and [22] on 
sparse PCA. Recently, [15] established posterior contraction rates for sparse linear regression 
with a spike and slab prior under comparable assumptions of the Lasso estimator [49, 7]. 
The results of [15] include posterior contraction rates for prediction error and estimation 
error, oracle inequalities and model selection consistency. However, sparse linear regression 
is only one example of high dimensional statistics. There is an indispensable demand of 
a Bayes theory on more complicated model settings such as dictionary learning, stochastic 
block model and multi-task learning, etc. It is not clear whether the method and the analysis 
used in [15] can be extended to these more complex settings. 

This paper provides a unified approach for both Bayes high dimensional and Bayes non¬ 
par ametric statistics in a general framework of structured linear models. We first establish 
a unified view of various high-dimensional and nonparametric models, and then propose a 
single prior distribution for all models considered in our framework. We establish optimal 
rates of convergence of the posterior distributions under appropriate conditions. The results 
directly lead to minimax posterior contraction rates in stochastic block model, biclustering, 
sparse linear regression, regression with group sparsity, multi-task learning and dictionary 
learning. Moreover, we also derive a general posterior oracle inequality that allows arbitrary 
model misspecification. Applications of the posterior oracle inequality let us obtain posterior 
contraction rates even for models that are not included in our framework. Examples consid¬ 
ered in this paper are nonparametric graphon estimation, linear regression with approximate 
sparsity, wavelet estimation under Besov space and various forms of nonparametric aggrega¬ 
tion. 

In the heart of our general theory is a proposed two-step prior distribution, which nat¬ 
urally accommodates the structured linear model by first modeling the structure and then 
modeling the parameters. This two-step modeling strategy was first investigated by [14] 
for Gaussian sequence models. A key ingredient of the prior distribution is that the tail of 
the distribution on the model parameter Q cannot be too light [14, 15], which motivates 
[14, 15] to use the independent Laplace prior with density proportional to exp(—A||Q||i) on 
the parameter. Though the prior distribution leads to optimal posterior contraction rates in 
Gaussian sequence model [14], it requires some excessive assumptions on the design matrix 
when it is applied to sparse linear regression [15]. The proposal in this paper is the elliptical 
Laplace distribution with density proportional to exp(—A|| JL ((5)||) for some linear operator 
^T(-). Note that we use the 1 2 norm instead of the t\ norm. With this choice, not only 
are we able to weaken the assumptions in [15], but we can also solve a more general class 
of problems in a unified way. To compensate the influence of the normalizing constant of 
an elliptical Laplace distribution, a correction factor on the prior mass is considered in the 
model selection step. 

The paper is organized as follows. Section 2 introduces the general framework of struc¬ 
tured linear models. A general prior distribution is proposed in Section 3. Section 4 presents 
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the main results of the paper including rate optimal posterior oracle inequality and posterior 
contraction. The main results are applied to ten examples ranging from nonparametric and 
high dimensional statistics in Section 5. In Section 6 , we present further results on sparse 
linear regression. All technical proofs are gathered in Section 7-10. 

We close this section by introducing some notation. Given an integer d, we use [d\ 
to denote the set {1,2,..., cZ}. For a set S , [S’! denotes its cardinality and I 5 denotes the 
indicator function. For a vector u = (rtj), ||u|| = denotes the i 2 norm. For a matrix 

A = ( Aij ) £ M nxp , and a subset T C [n] x [p], At denotes the array { A t }t£T • For any 
/ C [n] and J C \p\, we let A/* = A/ X [ p ] and = A[ n i x j. The Frobenius norm, l\ norm 

and loo norm are defined by ||A|| F = \jYlij WMi = J2ij \ A ij\ and Halloo = max *j |A^|, 

respectively. When A = A T € W ixp is symmetric, the operator norm ||A|| op is defined by its 
largest singular value and the matrix l\ norm || J ?4 ||^ 1 is dehned by the maximum row sum. 
The inner product is defined by ( u , v) = Yli u i v i when applied to vectors and is defined 
by (A,B) = Yhij AijBij when applied to matrices. Given two numbers a,b G R, we use 
a V b = max (a, b) and a A b = min(a, 6 ). The floor function is the largest integer no 
greater than a, and the ceiling function [a] is the smallest integer no less than a. For two 
positive sequences {a n }, {b n }, a n < b n means a n < Cb n for some constant C > 0 independent 
of n, and a n x b n means a n < b n and b n < a n . The symbols P and E denote generic 
probability and expectation operators whose distribution is determined from the context. 

2 Structured linear models 

Let us consider the following structured linear model 

Y = ar z (Q) + we r n , 

where W € M> N is a noise vector and ^z(-) is a linear operator. The signal z{Q) has 
two elements, the parameter Q and the structure/model Z that indexes the linear operator 
The structure Z is in some discrete space Z T , which is further indexed by r £ T 
for some finite set T. We introduce a function l(Z T ) that determines the dimension of the 
parameter Q. In other words, Q £ and l(Z T ) is referred to as the effective dimension 

of the structured linear model. The complexity of the model is dehned by the quantity 

l(Z T ) +log\Z T \, (1) 

the sum of the effective dimension and the logarithmic cardinality of the structure space. 
As we are going to show later, ( 1 ) will be the posterior contraction rate that we target at. 
Moreover, in all the examples considered in the paper, (1) will be the minimax rate under the 
prediction loss. The only requirement we impose on the model is the linearity of the operator 
That is, given any Z £ Z T with any r £ T, we have 

3Zz(Qi + Q 2 ) = 3Zz(Qi) + 3Zz{Q2), for all Qi,Q 2 G ( 2 ) 
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Therefore, we can also view ZXz as a matrix in R^x^CZ-r). From now on, whenever we apply a 
matrix operation with 3Czi the operator SXz is understood to be a matrix with slight abuse 
of notation. 

The above framework of structured linear models includes many examples. In this paper, 
we consider the following six representative instances. 

1. Stochastic block model. Consider 3Sz(Q) £ [0, l] nxn to be the mean matrix of a random 
graph with specification [&z(Q)\ij = Q z (i)z(j)- The object z E [k] n is the labels of the 
graph nodes. Moreover, it is easy to see that the parameter Q is of dimension k 2 . 
Therefore, stochastic block model is a special case of our general framework in view of 
the relation Z = z, t = k, T = [n], Z k = [k] n and £(Z k ) = k 2 . 

2. Biclustering. For a matrix SLz{Q) € R nxm , a biclustering model means that both 
rows and columns have clustering structures. That is, [3^z{Q)\ij = Qz 1 (i)z 2 (j) f° r some 
z\ E [ k] n and z 2 € [ l] m . The parameter Q has dimension kl. Thus, biclustering model 
is a special case of our general framework by the relation Z = (z±,Z 2 ), r = (k,l ), 
T=[n]x [m], Z k}l = [k] n x [l] m and £(Z k>l ) = kl. 

3. Sparse linear regression. A p-dimensional sparse linear regression model refers to X/3, 
where [3 E R p has a subset of nonzero entries and it can be represented by f3 T = (/3g , 0g c ) 
for some subset S C [p]. In other words, X(3 = X*s/3s- It can be represented in a general 
way by letting Z = S, t = s, T = \p\, Z s = {S C \p\ : \S\ = s}, £(Z S ) = s and Q = (3s- 
Moreover, &z{Q) = X*s/3s- 

4. Linear regression with group sparsity. It refers to the model XB with B E R pxm being 
a coefficient matrix with nonzero rows in some subset S C [p]. It can be represented in 
a general form similarly as the sparse linear regression except that £(Z S ) = ms. 

5. Multi-task learning. Similar to the last example, multi-task learning is the collection of 
m regression problems. That is, we consider XB for some B E R pxm . The jth column 
of B is represented as B*j = Q* z (j) for some z E [ k] m and Q E R pxfc . Thus, it is a 
special case of our general framework by letting Z = z, r = k, T = [m], Z k = [k] m and 
£{Z k ) = pk. 

6 . Dictionary learning. Consider the model 3tZz(.Q) = QZ E W 2Xd for some Z E {—1,0,1 }P xd 
and Q E R n,xp . Each column of Z is assumed to be sparse. Therefore, dictionary 
learning can be viewed as sparse regression without knowing the design. It can be 
written in a general form by letting r = (p, s), T = {(p, s) E [n A d] x [n] : s < p}, 
Z PtS = {Z E {-1, 0,1 } pxd : max ie[d ] |supp(Z*j)| < s} and £(Z p>s ) = np. 

3 The prior distribution 

In this section, we introduce a prior distribution on the structured linear model. The prior 
distribution has a two-step sampling procedure. First, we are going to sample a structure 
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Z. Second, given Z, we sample the parameter Q. Let us first present the prior distribution 
on the parameter Q £ We propose the elliptical Laplace distribution with density 

function proportion to exp (— X\\^z(Q)\\)- By direct calculation of the normalizing constant, 
the density function is 


fi(z T ),3T z ,\(Q) 


yjdet 

2 



r(iQZr)/ 2 ) 

mz r )) 


exp (—X\\^z(Q)\\) ■ 


(3) 


Recall that SZz is understood as a matrix in ~R Nx ^ Zr ^ whenever a matrix operation is applied. 
The elliptical Laplace distribution belongs to the elliptical family [20] with scatter matrix 
proportional to ZZz ) _1 - Compared with an i.i.d. distribution on Q , the density function 

(3) involves an extra factor i n the normalizing constant. This factor needs to be 

corrected in the model selection step. 

Let e(Z T ) be a function satisfying 


e(Z T )>£(Z T )+log\Z T \, 


(4) 


and then the sampling procedure of the prior distribution II on ZZz(Q) is given by: 

1 . Sample r ~ n from T, where 7r(r) oc r ^^)/ 2 ) ex P ( — --De(Z T )); 

2. Conditioning on r, sample Z uniformly from the set Z T = {Z £ Z T : det&z) > 0}; 

3. Conditioning on (r,Z), sample Q ~ fe(z T ),$r z , A- 

Step 1 weighs the structure index r by the function e(Z T ) that satisfies (4). For all the 
examples considered in the paper, e(Z T ) is chosen to be at the same order of the model 
complexity (1). The quantity r^q.z^)/ 2 ) ca ll e< 4 the correction factor that is imposed to 

compensate the influence of V y(HZ^)) ™ the elliptical Laplace distribution. Without the 
correction factor, exp (—De(Z T )) is the complexity prior used by [14, 15] in Gaussian sequence 
model and sparse linear regression. Since the support T is a finite set, 7r is always a valid 
probability mass function. Step 2 samples a structure Z uniformly in Z T . It is sufficient to 
consider such Z that det(^Zj ^z) > 0 for all the examples considered in this paper. Such 
restriction leads to a proper density function (3) and thus Step 3 is well defined. 

After defining the prior, we also need to specify the likelihood function. The six exam¬ 
ples in Section 2 have different distributions. For example, stochastic block model usually 
assumes a Bernoulli random graph, while sparse linear regression often works with gen¬ 
eral sub-Gaussian noise distributions. To pursue a unified approach, we propose to use the 
Gaussian likelihood Y\(Z,Q) ~ N(SZ'z(Q), In) throughout the paper. Then, the posterior 
distribution is 


n (3T Z (Q) £ U\Y) 


E, 


eT' 


-De(Z r ) y y/det(gggg) f r 

2^zez T \ Zt \ ) J, 


i%~z(Q)£U 


E, 


eT* 


Df(ZZ IW y/det % z ) ( A r — 1 

^zeZr [z7l J 


e -W-^z(Q)\?-M\x z (Q)\\ d Q 

\\Y-5t- z m\ 2 -M\%zm\dQ 
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Y(i(Z )/2) 

Note that in the above formula of posterior distribution, the factor v(t(z )) i n the Laplace 

normalizing constant has been cancelled out by the correction factor y^z")^) ™ the m °del 
selection prior. 

4 Main results 

In this section, we analyze the posterior distribution for the general structured linear model. 
Though the prior specifies a model J%fz(Q), we do not need to assume that the data is 
generated from the same model. Instead, we allow the data to be generated by an arbitrary 
signal with sub-Gaussian noise. That is, 


y = e* + w, 


where W = Y — 6 * is the noise vector with a sub-Gaussian tail satisfying 

P(\(W,K)\ >t)< e ~ pt2/2 for all ||R|| = 1. (5) 


The sub-Gaussianity number p > 0 is assumed to be a constant throughout the paper. We 
also assume a mild assumption on the function e(Z T ). That is, 

|{t € T : t — 1 < e(Z T ) < i}| < t for all t € N. (6) 


Recall that A and D are parameters of the prior distribution II, and the main result of the 
paper is stated in the following theorem. 

Theorem 4.1. Assume (4), (5) and ( 6 ). Given any 6 * € M jV , any t* € T, any Z* € Z T *, 
any Q* € any constants X,p > 0 and any sufficiently small constant 5 G (0,1), there 

exists some constant D\^ )P > 0 only depending on A, 5, p, such that 


En (e(Z T ) > (1 + Si)e(Z T *) + 5i\\^ z *(Q*) 
< exp (-C' (e(Z T *) + || &z*(Q*) ~ e*\\ 2 )) 



(7) 


and 


En 


- 6*\\ 2 > (1 + 5 2 )\\%-z*(Q*) - 0 *\\ 2 + Me(Z r *)\Y^J 


< exp (-C” (e(Z T .) + \\3£ Z *{Q*) - ^|| 2 )) 


( 8 ) 


for any constant D > D\^ p with <5i = 6 , 62 = 8^5/p and some constants M,C',C" only 
depending on A, 5, p,D. 


Remark 4.1. The results of Theorem 4-1 hold for all e{Z T ) satisfying (4). By choosing e(Z T ) 
at the same order of (1), we obtain the rate d(Z T *) + log \Z T *\ under the posterior distribution. 
From now on, we refer to both (1) and e(Z T ) as the complexity function. 
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Remark 4.2. By scrutinizing the proof of Theorem f.l, the assumption ( 6 ) can he weakened. 
In fact, we only require |{r £ T : t — 1 < e(Z T ) < f}| < at b for arbitrary constants a, b > 0 
for the result of Theorem f.l to hold. However, the current form ( 6 ) is much simpler and is 
sufficient for all the examples considered in the paper. 


Theorem 4.1 contains two results of an oracle type, where ST,z*(Q *) is understood to be 
the oracle model that best approximates the true signal 9*. The first result (7) shows that 
the model complexity selected by the posterior distribution is not greater than the sum of the 
complexity of the oracle and a model misspecification term quantified by \\SFz*{Q*) — 6 *|| 2 . 
The second result (8) is a posterior oracle inequality for the squared error loss || &z(Q) — 9*\\‘ 2 . 
Compared with that of the oracle %z*(Q*), the squared error loss of SFziQ) has an extra 
term proportional to e(Z T *). It is worth noting that the constant (1 + 62 ) in (8) can be 
arbitrarily close to 1, as long as D is chosen sufficiently large. Since our procedure involves 
a model selection step, an oracle inequality with constant exactly 1 is impossible, which is 
implied by a counter-example in [45] for sparse linear regression. Besides, we do not impose 
any assumption on the operator Sfiz(') except its linearity (2). In the regression model, this 
means the results are assumption-free for the design matrix. 

When the model is well specified in the sense that 8 * = SZz*(Q *), Theorem 4.1 reduces 
to the following results on posterior contraction. 


Corollary 4.1. Assume (4), (5) and ( 6 ). For any 9* = STz*{Q*) with any Z* € Z T *, 
any t* € T. any Q* € any constants A, p > 0 and any sufficiently small constant 

5 € (0,1), there exists some constant D\ t s,p > 0 only depending on A ,5,p, such that 


En (e{Z T ) > (1 + S)e(Z T *) y') < exp (~C'e{Z T *)) 


and 


En 


- 8*|| 2 > Me(Z r *) y) < exp (-C"e{Z T *)) 


for any constant D > D\ ^ p with some constants M,C',C" only depending on X, 6 ,p,D. 


Therefore, the posterior contraction rate under the squared error loss is e{Z T *), which can 
be taken at the order of i(Z T *) + log \Z T *\. As we are going to show in the next section, the 
rate is minimax optimal for all the examples considered in the paper. 


5 Applications 

5.1 Stochastic block model 

Stochastic block model was proposed by [28] to model random graphs with a community 
structure. Given a symmetric adjacency matrix A = A T € {0, l} nxn that codes an undirected 
network with no self loops in the sense that An = 0 for all i € [n], stochastic block model 
assumes {Aij}i > j are independent Bernoulli random variables with mean 9ij = Q z u) z (j) G 
[0,1] with some matrix Q € [0, l] fcxfc and some label vector z € [k] n . In other words, the 
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probability that there is an edge between the zth and the jth nodes only depends on their 
community labels z(i ) and z(j). Recently, the problem of estimating the success matrix 9 
receives much attention. The minimax rate of estimating 9 under the Frobenius norm was 
established by [23]. However, the upper bound in [23] was achieved by a procedure assuming 
the knowledge of the true number of community k* , and is not adaptive. The Bayes framework 
proposed in this paper provides a natural solution to adaptive estimation for stochastic block 
model. 

Let us write the stochastic block model in a general from as 9 VJ = [3£z{Q)\ij = Qz(i)z(j ) 
for all i fi j. We do not need to model the diagonal entries because An = 0 for all i € [n] 
as convention. Then, Z = z, r = k, T = [n] and Z^ = [k] n . Though the true parameter Q* 
is symmetric, we do not impose symmetry for the prior distribution. Hence, £(Zk) = k 2 and 

(4) is satisfied with e(Z/~) = k 2 -\-nlogk. The general prior distribution n can be specialized 
to this case as 

1 . Sample k ~ ir from [n], where tt( k) oc Y^/ 2 ) ex P ( — D(k 2 + nlog A;)); 

2. Conditioning on k, sample z uniformly from [k] n ; 

3. Conditioning on (k,z), sample Q ~ fk,z,\ , where fk, z ,x(Q ) oc e 

4. Set 9ij = Q z (i) z (j) for all i ^ j and 9a = 0 for all i G [n]. 

Note that in Step 2 , we use Z^ = [k] n instead of Z^. This is because (Q\ ) z (i) z (j) = (Q 2 )z(i)z(j) 
for all i ^ j implies Q 1 = Q 2 , and thus Zy. = Z^ = [k] n . To better understand the density 
function fk z \i consider the case where n/k is an integer and the community sizes |{i € [n] : 
z{i) = it}| = n/k are equal for all u € [A;]. Then fk z ,\(Q) °c e if we also include the 

diagonal entries. The exponent of the general form of fk, z \ involves a weighted norm of Q 
depending on the community sizes. 

To study the posterior distribution, let us assume that the adjacency matrix is generated 
by the true mean 9*j = € [0,1] for i / j and <9* = 0 for all i € [n]. 

Assume z* € [ k*] n for some k* € [n]. It is easy to see that the noise W = A — 9* satisfies 

(5) for some constant p > 0 by Hoeffding’s inequality. Moreover, the complexity function 
e(Z T ) = k 2 + nlog k satisfies ( 6 ). Hence, Corollary 4.1 can be specialized for the stochastic 
block model. 


Corollary 5.1. For any 9* and k* specified above, any constant A > 0 and any sufficiently 
small constant 5 € (0,1), there exists some constant D\ t s > 0 only depending on A ,5 such 
that 


Eu(k 2 + nlogk> (l + <5) ((F) 2 + nlog k*) a) < exp (—C'((k *) 2 + nlogk*)) 


and 


En(j|0- 0*||| > M((k *) 2 + n\ogk*) A) < exp {-C"({k *) 2 + nlog k*)) 
for any constant D > D \ t 5 with some constants only depending on X,S,D. 






To the best of our knowledge, this is the first Bayes estimator for stochastic block model 
with theoretical justification. The posterior contraction rate is (A;*) 2 + nlogfc*. According 
to [23], this is the minimax rate of the problem. When k* < \Jn logn, the rate is dominated 
by nlog A’*, which grows only logarithmically as k* grows. When k < \Jn logn, the rate is 
dominated by ( k *) 2 , corresponding to the number of parameters. Since posterior contraction 
implies the existence of a point estimator with the same rate [25], the posterior mean is 
automatically a rate-optimal adaptive estimator. 


5.2 Biclustering 

The biclustering model, originated in [26], can be viewed as an asymmetric extension of the 
stochastic block model. The data matrix Y £ M nxm j s assumed to be generated by a signal 
matrix 9 = ( 6ij ) with form 0{j = Q zl (i) Z2 (j) f° r some label vectors Z\ £ [k] n and Z 2 € [ l] m . In 
other words, the rows of 9 have k clusters and the columns of 6 have l clusters. The values 
of {6 if) that belong to the same row-cluster and the same column-cluster are constant. The 
goal is to recover the true signal matrix 9* from the observation Y . 

To put it in our general form, observe that Z = (zi,z 2 ), T = (fc, Z), T = [n] x [m], Z P i = 
[k] n x [l] m and t(fZ n f) = kl. Moreover, the complexity function is e{Z^i) = kl+k log n+l log m, 
which satisfies (4) and (6). The general prior II can be specialized to this case as 

1. Sample (A, l) ~ n from [n] x [m], where 7 t(A, l ) oc y(u/ 2 ) ex P (~D(kl + nlogA’ + rn log / )); 

2. Conditioning on ( k,l ), sample (zi,Z 2 ) uniformly from [k] n X [Z] m ; 

—A it . 

3. Conditioning on (k, l, z ly z 2 ), sample Q ~ fk,l,z u z 2 , A with fk,l, zl ,z 2 ,x{Q ) « e V « 

4. Set 9 i:j = Q zl (i) Z 2 (j) for all 

In Step 2, we use Z^j because Z^j = Z^^ for the same reason as we have argued for the 
stochastic block model. To analyze the posterior distribution, consider data Y = 8 * + W, 
where the signal 0 * admits a biclustering structure such that 0 *j = Q^*(i)z*0') ^ or ^ ^ * l 
and (zl,z%) € [k*] n x [r] m , and the noise W is assumed to satisfy (5). 


Corollary 5.2. For any 9* and ( k*,l*) specified above, any constants A ,p > 0 and any 
sufficiently small constant 5 £ (0,1), there exists some constant D\ : s, p > 0 only depending on 
A, 5, p such that 


Eli ykl + nlogk + mlogZ > (1 + 5) ( k*l* + n log/c* + m log l*) 
< exp (— C'(k*l* + nlog k* + mlogZ*)) 


Y 


and 


sn 


(11^ — ^*IIf > M(k*l* + nlog A:* + rn log l*) Y^j < exp (— C”{k*l* + nlog A;* + rn log l*)) 


for any constant D > D\^ tP with some constants M,C',C" only depending on X,6,p,D. 
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The posterior contraction rate for recovering a signal matrix with a biclustering structure 
is k*l* + nlogk* + mlogl*, which is minimax optimal according to [23]. To the best of our 
knowledge, this is the first adaptive estimation result for biclustering with optimal rate. 

5.3 Sparse linear regression 

Consider a regression problem with fixed design Xf3, where X € M nxp and /3 € R p . The 
regression coefficient is assumed to be sparse so that ft 1 = (fig, 0g c ) for some Sc[p]. Recov¬ 
ering the mean vector X/3 and the regression vector f3 with a sparse prior has been considered 
in [15]. However, the results of [15] imposed strong assumptions that are commonly used for 
the Lasso estimator [7]. In this section, we show that the general prior distribution that we 
propose in Section 3 leads to optimal posterior contraction rates with minimal assumptions. 

First, we note that the sparse linear regression model is a special case of the general 
structured linear model by letting Z = S, t = s, T = [p], Z s = {S' C [p] : 151 = s}, 
£(Z S ) = s and Q = /3s- Then, we have the representation &z(Q) = X*sPs = X/3. Since 
log \Z S \ = log ( p ) < slog the complexity function e(Z s ) = 2s log ^ satisfies the condition 
(4). It is also easy to check that e(Z T ) satisfies (6). We specialize the general prior n in 
Section 3 as follows. 

1. Sample s ~ ir from [p], where 7r(s) oc exp (—2 Ds log ; 

2. Conditioning on s, sample S uniformly from {S C [p] : |5| = s,det(Xj s X*s) > 0}; 

3. Conditioning on (s, S ), sample (3$ ~ fs,S, A with fs,s,\(Ps) ^ e~ x ^ x * s/3s ^ and set (3s c = 0. 

Note that in Step 1, we use e(Z s ) = 2s log ^ instead of the exact form of i(Z T ) + log|Z r | 
in the exponent for simplicity. In Step 2, we sample S from the set Z s = {S C [p] : |S| = 
s, de^X^Xfg) > 0}. In this way, the density f 8 ,s,\ i n Step 3 is not degenerate. Since 
X*s € M nxs , when s > n, we must have Z s = 0. Hence, we may also replace it in Step 1 
by its renormalized version supported on [n]. Furthermore, note that the exponent on the 
density of /3s is —A||-X*s/3,s||, compared to — A||/3s||i in [15]. We let the prior depend on the 
design matrix X to obtain assumption-free optimal posterior prediction rate. The idea of 
design-dependent prior was also employed by [40] in an empirical pseudo-Bayes framework. 
Moreover, e -A H A "* s ^ s H has an exponential tail, which is capable of modeling a large regression 
coefficient. 

The prior distribution involves a correction factor in the model selection step to 

compensate the normalizing constant of the elliptical Laplace distribution. Without this 
factor, exp (—2L>slog^) is the common prior distribution on the model dimension used in 
[45, 14, 22, 15, 40]. Since exp (—2Zlslog is a decreasing function of s, it gives less weights 
for more complex models. However, with the correction factor, this is not true because 
tt( s) oc exp (—2 Ds log is not necessarily a decreasing function of s. For a large 

D > 0, we have vr (y/p) < vr(p), which leads to a counter-intuitive prior modeling strategy. 

Let us proceed to specify the truth. That is, Y = X/3* + W for some /3* with support 
S* and sparsity 15*1 = s*. The noise vector is assumed to be sub-Gaussian in the sense of 
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(5). Without loss of generality, we may assume S* € Z s *. This is because if Xfs* is collinear 
in the sense that det (Xj'gtX+g*) = 0, there always exists a with support S\ and sparsity 
si = |Si| such that X/3* = X(3\ and det (Xj Sl X*Si) > 0- We may simply redefine (s*, S*) by 

Corollary 5.3. For any /3*, S* € Z s * and s* specified above, any constants X, p > 0 and any 
sufficiently small constant 5 € (0,1), there exists some constant D\^ p > 0 only depending on 
A, 5, p such that 

(9) 

( 10 ) 


En (s > (1 + 5)s* y) < exp (-C's* log 


and 


En 


ep 


-xp*\r > Ms* log — 
s* 


Y ) < exp ( -C"s* log ^ 


for any constant D > Dx,s, p with some constants M,C',C" only depending on X,S,p,D. 

The result (9) is implied by (7) that slog^ < (1 + <5i)s*logp? under the posterior 
distribution. It improves the corresponding bounds in [14, 15] at a constant level. The result 
(10) achieves the minimax optimal prediction rate with no assumption on the design matrix 
X, which is comparable to the frequentist result in [8]. Slight improvement of (10) will be 
discussed in Section 5.10. 

Besides optimal prediction rate, we are ready to obtain optimal estimation rates given 
(9) and (10). Define 


k 2 = 


ll^ll 

{^0:||6||o<(2+<5)s*} y/n\\b\\ 


mm 


and Ki = 


^0:||&||o<(2+5)s* \/n||6||i 


mm 


( 11 ) 


Note that k 2 is the restricted eigenvalue constant [12, 7] and «q is the compatibility constant 

[ 10 ]. 


Corollary 5.4. Under the setting of Corollary 5.5, we have 


e (ii/?-m 2 > m s logs 


ep 


TlKTy 


Y ) < 2 exp ( — (C" + C")s* log — 


and 


e(\\P-P*\\1 > < 2exp /_ fc’ + C")s*\og^-\ 

\ nn\ J V s* / 


for the same constants M,C',C" in Corollary 5.5. 


Compared with the minimax rates [18, 54], Corollary 5.4 obtains optimal estimation rates 
for both £ 2 and t\ loss functions. Moreover, the dependence on the quantities k 2 and «q are 
optimal [44], compared with the Lasso estimator and the spike and slab prior [15]. When 
k x ki k 2 , the rates of Lasso depend on k through k 4 for both the loss ||-|| 2 [7] and the loss 
||-|| 2 [52], and the rates of the spike and slab prior depend on k through k 6 for the loss ||-|| 2 
and k 8 for the loss ||-|| 2 [15], while we obtain the optimal dependence fi 2 in Corollary 5.4. 

The results on convergence and model selection consistency for sparse linear regression 
are not implied by the general theory. We are going to treat it separately in Section 6. 
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5.4 Linear regression with group sparsity 

Let us consider a multiple regression set up XB for X E M nxp and B E M pxm . The matrix 
B collects regression coefficients from m regression problems. We assume the m regression 
coefficients share the same support. That is, there is some S C \p\ such that Bg <=* = 0. In 
other words, S is the nonzero rows of B. The concept of group sparsity was proposed by 
[3, 58], and frequentist statistical properties were analyzed by [35]. 

To apply a Bayes procedure, let us write the problem in a general form by Z = S, r = s, 
T = [p], Z = {S C \p\ : \S\ = s}, £(Z S ) = ms and Q = Bg*. Then, we have the representation 
3Fz{Q) = X*sBs* = XB. The choice e(Z s ) = s (m + log satisfies the conditions (4) and 
(6). The prior distribution II is similar to that used in Section 5.3. 

1. Sample s ~ n from [p], where ir(s) oc exp (—Ds (m + log ^)); 

2. Conditioning on s, sample S uniformly from Z s = {S C \p\ : |5| = s, det (Xj s X*g) > 0}; 

3. Conditioning on (s,S), sample Bg * ~ f s ,s,x with f s ,s,\(Bs*) oc e _A ll x * s ' Bs *H F and set 
Bsc * = 0 . 


Note that we also use Z s in Step 2 as what we have done for sparse linear regression. Assume 
the data is generated by Y = XB* + W for some matrix B* with support S* and sparsity 
s*. Again, without loss of generality, we assume S* E Z s *. The noise matrix W is assumed 
to be the sub-Gaussian in the sense of (5). 


Corollary 5.5. For any B*, S* E Z s * and s* specified above, any constants X, p > 0 and any 
sufficiently small constant 5 E (0,1), there exists some constant D\ s tP > 0 only depending on 
X, 6, p such that 


En (s > (1 + 6 )s* 


Y ) < exp ( —C's* (m + log 


ep 


and 


Eli \\XB — XB* 


I 2 > 
If > 


Ms* (jn + log 


Y ) < exp ( —C”s* (m + log 


for any constant D > D\ t s,p with some constants M,C',C" only depending on X,S,p,D. 

The posterior contraction rate for the prediction loss is s* (m + log p), which is minimax 
optimal according to [35, 39]. Posterior contraction for various estimation loss functions can 
also be derived in a similar way as in Section 5.3, and we omit the details. 


5.5 Multi-task learning 

Multi-task learning is another name for multiple linear regression in the form of XB with 
X E M nxp and B E M pxm . Compared with m independent linear regression problems, a 
typical multi-task learning setting assumes some dependent structure among the columns 
of the coefficient matrix B. The group sparsity assumption considered in Section 5.4 is an 
example where the columns of B share the same support. In this section, we assume a 
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clustering structure among the columns of B. That is, B*j = Q* z tj) for some z E [k] m and 
Q E M. pxk . In other words, the m regression coefficient vectors are allowed to choose from k 
possibilities. When the design X is an identity matrix, it reduces to an ordinary clustering 
problem. 

Let us write the multi-task learning problem in the general form. This can be done 
by letting Z = z, r = k, T = [m], Z *. = [k] m and £(Zk) = pk. Moreover, we have the 
representation [3Xz{Q)\*j = XQ* z (j)- The complexity function e(Z T ) = pk + mlogk satisfies 
the conditions (4) and (6). The general prior distribution II can be specialized to this case. 
Consider a full rank design matrix that det(A T X) > 0. 

1. Sample k ~ ir from [p], where ir(k) oc p^.ya) ex P (~D(pk + mlog A;)); 

2. Conditioning on k, sample z uniformly from [ k] m ; 

3. Conditioning on (k,z), sample Q ~ fk,z,x with fk, z ,x(Q) oc e~ A v / ^' II- y< 2zG>II 2 ; 

4. Set B*j = Q* z (j) for all j E [m]. 

Note that in Step 2, we use Z^ = [k] m because Z^ = Z^, which is due to det(X r A") > 0. 
The full rankness of the design matrix implicitly implies p < n. In fact, the assumption 
det(X T A) > 0 is without loss of generality, because whenever det(A T A) = 0, one can simply 
use a subset of the variables that are linearly independent without affecting the prediction 
error. 

To state the result of posterior contraction, let us assume that the data is generated as 
Y = XB* + W for some matrix B* satisfying BY = with some Q* and z* € [k*] m . 

The noise matrix is assumed to satisfy (5). 


Corollary 5.6. For any B* and k* specified above, any constants A, p > 0 and any sufficiently 
small constant 5 E (0,1), there exists some constant D\ t s tP > 0 only depending on A ,5,p such 
that 


En 


[pk + m log k > (1 + 5)(pk* + mlogk*) < exp (— C'(pk* + mlogA;*)) 


and 


En (\\XB - XB*\\l > M(pk* + m log k*) f) < exp {-C"(pk* + mlogk*)) 
for any constant D > D\ t s tP with some constants M,C',C" only depending on \,6,p,D. 


The posterior contraction rate for multi-task learning is pk* + mlog A:*, which is smaller 
than the rate pm for m independent linear regressions. When mlogfc* < pk*, the rate 
becomes pk* + mlog A;* x pk*. In this case, the procedure performs as well as when the 
clustering structure z* is known. According to [38], the rate pk* + mlog A;* is minimax 
optimal. 
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5.6 Dictionary learning 

Dictionary learning can be viewed as a linear regression problem without knowing the design 
matrix. Mathematically, the signal matrix 9 G M. nxd can be represented as 9 = QZ for some 
Q £ M nxp and Z G MP xd . Both the dictionary Q and the coefficient matrix Z are unknown. 
A common assumption is that each column of Z is sparse, and the goal is to learn the latent 
sparse representation of the signal. Thus, the problem is also referred to as sparse coding [43]. 
Recently, the minimax rate of dictionary learning has been established by [38] for estimating 
the true signal matrix 9*. In this section, we provide a Bayes solution to the adaptive 
estimation problem of dictionary learning. Following [1], we consider a discrete version of 
the problem. Namely, Z G {—1,0, l} pxd . Then, the problem can be represented in a general 
form by letting r = (p, s), T = {(p, s) G [n A d\ x [n] : s < p}, Z P)S = {Z G {—1,0, l} pxd : 
maxjg^ |supp(Z*j)| < s} and £(Z PiS ) = np. Moreover, we have the representation &z(Q) = 
QZ. The complexity function is £(Z PyS ) +log \Z P)S \ = np+d (log ( p ) + 3 log s). With e(Z PtS ) = 
3 (np + ds log —), (4) and (6) are satisfied. The general prior distribution II can be specialized 
into the following sampling procedures. 

1. Sample (p, s) ~ 7 r from T with 7 r(p, s ) oc exp (—3 D (np + ds log ^)); 

2. Given (p, s), sample Z uniformly from Z P)S = {Z G Z P}S : det(ZZ 7 ) > 0}; 

3. Given (p, s,Z), sample Q ~ f p , s ,z,x with f p , s ,z,\{Q) oc e ~ x W QZ W ¥ \ 

4. Set 9 = QZ. 


Note that we have used e(Z PjS ) = 3 (np + dslog instead of the exact £(Z r ) + log \Z T \ in 
Step 1 for simplicity. 

In order to state posterior rate of contraction, we assume that the data is generated by 
Y = 9* + W for some noise matrix W satisfying (5). The signal 9* is assumed to admits a 
sparse representation 9* = Q*Z*. Without loss of generality, we can always let the matrix 
Z* belong to the set Z p * tS *. This is because when det (Z*(Z*) T ) = 0, there must exist some 
Qi G M nx J 51 and Z\ G Z pitSl such that 9* = Q*Z* = Q\Z\. 

Corollary 5.7. For any 9* = Q*Z* with Z* G Z p *^ s * specified above, any constants A,p > 0 
and any sufficiently small constant 6 G (0,1), there exists some constant D\^ }P > 0 only 
depending on A, 5, p such that 

Eli ^np + ds log — > (1 + 5) ^ np* + ds* log Y^j < exp C' ^ np* + ds* log ^ 

and 


En ||0-0*||| > m 


np* + ds* log 

s* 


Y ) < exp ( — C" ( np* + ds* log 


for any constant D > D\^ tP with some constants M,C',C" only depending on \,S,p,D. 
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The rate we have obtained from (5.7) is np* + ds*log ^r, which is minimax optimal 
according to [38] . The part ds* log is the error for recovering d sparse regression coefficient 
vectors, and the part np* is the price that one has to pay for not knowing the design matrix 
Q*. The result can be extended to the case where the entries of Z* are allowed to take values 
in an arbitrary discrete set with finite cardinality. To the best of our knowledge, this is the 
first adaptive estimation result for dictionary learning with optimal prediction rate. 

5.7 Nonparametric graphon estimation 

Consider a random graph with adjacency matrix { Aij } € {0,l} nxn , whose sampling proce¬ 
dure is determined by 


(6, -)£n) ~ ~ Bernoulli(6>*), where 0*- = /*(&,&)■ (12) 

For i G [n], An = 9 * = 0. Conditioning on (£i,...,£n), ^ij = Aji is independent across 
i > j. The function f* on [0, l] 2 , which is assumed to be symmetric, is called graphon. The 
concept of graphon is originated from graph limit theory [29, 37, 17, 36] and the studies of 
exchangeable arrays [2, 31]. It is the underlying nonparametric object that generates the 
random graph. 

Let us proceed to specify the function class of graphons. Define the derivative operator 
by 

Qi+k 

and we adopt the convention Vqo f(x,y) = f(x,y). The Holder norm is defined as 


ll/lk 


max sup \Vjkf(x, y)\ + max sup 

j+k<\a\ x,yeV j+k=[a\ { x ,y)^{x' ,y')eD 


Wjkf{x,y) ~ vjkf{x',y')\ 
\\(x - x',y - y')\\ a ~l a l 


where V = {(x,y) 6 [0, l] 2 : x > y}. Then, the graphon class with Holder smoothness a is 
defined by 


F a {L) = {0 < / < 1 : \\f\\n a < L,f(x,y) = f(y,x) for all x € T>} , 

where L > 0 is the radius of the class, which is assumed to be a constant. Recently, a minimax 
optimal estimator of f* was proposed by [23] given the knowledge of a. In this section, we 
solve the adaptive graphon estimation problem via a Bayes procedure. 

As argued in [23], it is sufficient to approximate a graphon with Holder smoothness by 
a piecewise constant function. In the random graph setting, a piecewise constant function 
is the stochastic block model. Therefore, we apply the prior distribution in Section 5.1 by 
equating /(<*;,;,) = Qij. The oracle inequality in Theorem 4.1 gives the desired bias-variance 
tradeoff of the problem. 

Corollary 5.8. Consider the prior distribution specified in Section 5.1. For the class F a (L) 
with a, L > 0 define above and any constant A > 0, there exists some constant D\ > 0 only 
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depending on A such that 


< 


sup supEn ( \ V (/(&,<,■) - f*{£,i,£j)) 2 > m 

f*€F a (,L) P 5 ' r ' 


Jje[n] 


exp (^—C' ^n“+! + nlogn^ 



2 a 

a+1 + 




for any constant D > D\ with some constants M,C' only depending on X,D,L. 


Remark 5.1. The expectation in Corollary 5.8 is associated with the joint distribution (12) 
over both {Aij} and {£*}. Moreover, we do not assume any assumption on the distribution 
on {£*}, and the result of Corollary 5.8 holds uniformly over all P^. 

The posterior contraction rate we have obtained for graphon estimation is 

2a 

which is minimax optimal according to [23]. When ct £ (0,1), the rate is dominated by n a + 1 , 
which is the typical two-dimensional nonparametric regression rate. When a > 1, the rate 
becomes which does not depend on a anymore. The key difference between graphon 
estimation and nonparametric regression lies in the knowledge of the design sequence {£i}. A 
nonparametric regression problem observes the pair {(£?;■ (j), A^}, while graphon estimation 
only observes the adjacency matrix {A^}, resulting in an extra term ^2- in the rate. To 
the best of our knowledge, Corollary 5.8 is the first adaptive estimation result on graphon 
estimation with optimal convergence rate. 


5.8 Linear regression under weak £ q ball 

Section 5.3 studied high dimensional linear regression under exact sparsity. In this section, 
we assume the regression coefficients are approximately sparse. Theorem 4.1 allows us to 
derive optimal posterior rates of contraction even when the prior only charges signals with 
exact sparsity via a bias variance tradeoff argument. Let us assume the data is generated by 
Y = X/3* + W € M p with some design X £ M nxp and some noise vector satisfying (5). We 
assume j3* is approximately sparse by letting 

P* € B q (k) = j/3 € M p : max j|0|* < k) 
l j e[p] y ’ ) 

with some q £ [0,1], where we order the absolute values of the entries of f3 by |/3|(i) > |/3|(2) > 
• > |/3|(p). Namely, /3* is assumed to have weak l q radius at most k. To facilitate the 

presentation, we define the effective sparsity by s* = [a:*], where 

f ( n V /2 1 

x* = max <0 <x<p:x<k \ -— -— > . 

[ V lo g (ep/x)J J 

The effective sparsity s* is a function of q,k,p,n. Note that in the exact sparse case where 
q = 0, we have s* = k. Let us use the prior distribution specified in Section 5.3, and we have 
the following result. 
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Corollary 5.9. Assume maxj g [ p ] n _1 / 2 ||A*j|| < L for some constant L > 0. For any q € 
[0,1], k and s* specified above and any constants \, p > 0, there exists some constant D\ p > 0 
only depending on on A, p such that 


sup 

/3*e B v (k) 


En 


Xfi*|| 2 > Ms* log 


ep 


Y < exp —C s* log 


ep 


for any constant D > D\ p with some constants M,C' only depending on X,p,D,L. 


With s* being the effective sparsity, the posterior rate of contraction has the same form 
as that of Corollary 5.5. The rate is known to be minimax optimal [18, 44]. In the special 

case when k < p 1 ~ ri ^pp^ for some constant 77 € (0,1), the rate has an explicit formula in 

terms of k, which is 5 log ^p/ s ) x k ^pp^ • When X is an identity matrix, Corollary 5.9 
reduces to the results for sparse Gaussian sequence model in [14]. Besides the prediction error, 
estimation error under approximate sparsity can be derived in the same way as Corollary 5.4, 
and we omit this part due to the similarity. 


5.9 Wavelet estimation under Besov space 

In this section, we apply the general prior distribution in Section 3 to establish optimal Bayes 
wavelet estimation under Besov space. Assume the data is generated as 

Y jk = e* k + ^=W jk , k = l,...,2p j = 0,1,2,.., (13) 

v n 

where {Wj k } are i.i.d. N (0,1) variables. It is well known that the sequence model is equivalent 
to Gaussian white noise model [30], and it is closely related to nonparametric regression and 
density estimation [9, 42], Under a wavelet basis, {0jk\ are understood as wavelet coefficients. 
We assume the true signal 6 * = {0*j k } belongs to the Besov ball defined by 


0 “ 

p,q 


(L)={6:^ ajq P j * 


\ q < 

ip — 


L q 


(14) 


for some p,q,a,L > 0 and a = a + \ — The Besov ball (14) naturally induces a multi¬ 
resolution structure of the signal. This inspires us to use a sparse prior distribution indepen¬ 
dently at each resolution level. That is, we consider a prior distribution II on 9 satisfying 


U{d0) = IJlIj(d0 i „). 

3 

The prior distribution IL,- on the jth level for j < log 2 n is specified as follows: 

1. Sample Sj ~ 7r from [2 J ], where 7r (sj) oc exp Dsj log ; 

2. Conditioning on Sj, sample Sj uniformly from {Sj C [2 J ] : \Sj \ = sy }; 
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3. Conditioning on (sj,Sj), sample 0 jSj ~ f SjjSj ,\ with ..s v .a!^.s;,) <x e 11 and 

set 6 js c . = 0. 

For j > log 2 n, let Hj(9j* = 0) = 1. Using Theorem 4.1 at each resolution level, we are able 
to establish the posterior contraction rate in the following corollary. 


Corollary 5.10. For any costants p, q, a satisfying 0 < p, q < oo, L > 0 and a > ^ and any 
constant A > 0, there exists some constant D\ only depending on A such that 

Y) < exp (—C'n^+ 1/log n) . 


sup EU[\\0-0*\\ 2 > Mn~^TT 




for any D > D\ with some constants M,C' only depending on A, D,a,p, L. 


The result of Corollary 5.10 can be regarded as a Bayes version of Theorem 12.1 of 
[30] under the same condition. The rate n 2a + 1 is minimax optimal over the class Q^ q {L). 
Posterior contraction for (13) over the class 0" q (L) has been investigated by [53, 47, 21, 27] 
only for a restricted configuration of (p. q, a). In comparison, Corollary 5.10 obtains adaptive 
optimal posterior contraction rates to all possible combinations of (p. q , a) considered in the 
frequentist literature [30]. 

When p = q = 2, the class 0“ ? (L) is equivalent to a Sobolev ball. It is worth noting 
that in this case the prior distribution can be greatly simplified. Let us recast (13) into the 
sequence model with single index. That is, consider data generated by 

Y j =9* + -^W j , j = 1,2,3,..., 

Vn 


with {Wj} being i.i.d. 1V(0,1) variables. Assume the true signal 9* 
Sobolev ball defined by 


SJL) = { 




. 


We use the following version of the general prior II in Section 3. 

1. Sample k ~ n from [n], where n(k) oc exp (—Dk)\ 


{9*} belongs to the 


2. Conditioning on k, sample 9^ = (#i, ...,0k) ~ fk,\ with fk,\(9[k]) c* e and set 

9j = 0 for all j > k. 

Note that the prior distribution has a missing step compared with the general prior in Section 

3. This is because Z\. = {[&]} is a set of singleton so that the model is determined by k and 
we do not need to perform a further selection. Specializing Theorem 4.1 to this case, we 
obtain the following result. 


Corollary 5.11. For any constants a,L > 0 and any constant A > 0, there exists some 
constant D\ only depending on A such that 


sup Eli 


9 - (9*11 2 > MTW 


Y ) < exp (—C' , n 2a + 1 


for any D > D\ with some constants M,C’ only depending on X,D,a,L. 
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Thus, we have obtained rate-optimal adaptive posterior contraction over the Sobolev ball 
through a very simple prior distribution. 

5.10 Aggregation 

Aggregation in nonparametric regression has been considered by [41, 50, 16, 56, 33] among 
others. Let us start with the nonparametric regression setting with fixed design. That is, the 
data is generated by 

Yi = f*(xi) + Wi, i = 1, ..., n, (15) 

where the noise vector W = { Wi } is assumed to satisfy (5). The goal of nonparametric 
regression is to estimate the true regression function /* by some estimator / under the loss 

1 n 2 

i— 1 

where ||-|| n stands for the empirical i 2 norm. Assume we are given a collection of functions 
called the dictionary, and we are also given a subset 0 C M p . For /3 € 0, 
define fp = J2j=i Pjfj- The goal of aggregation is to find an estimator / such that its error 
11/ — /*lln is comparable to that given by the best among the class {fp : (3 € 0}. To be 
specific, one seeks an / that satisfies the following oracle inequality 

11 / - rf n < (1 + 5) mf H /0 - n\l + A n>p ( 0 ) (16) 

with high probability for some arbitrarily small constant 5 E (0,1) and some optimal rate 
function A njP (0) determined by the class 0. Various types of aggregation problems include 
linear, convex and model selection aggregation, etc., which is determined by the choice of 
the class 0. In this section, we provide a single Bayes solution to various types of aggre¬ 
gation problems simultaneously and establish the oracle inequality (16) under the posterior 
distribution. 

Since the vector fp = (fp{xi),...,fp(x n )) can be represented as A/3 with the matrix X 
having entries Ay = fj(xi) for all (i-j) €E [n] x [p], the aggregation problem can be recast as a 
linear regression problem. Define r = rank(A). Without loss of generality, we assume the first 
r columns of A span the column space of A. That is, span({A*j} Jg [ r ]) = span({A*j} jG [p]). 
We are going to use a modified version of the prior distribution defined in Section 5.3. 

1. Sample s ~ n from [r], where 7r(s) = ex P ( — -Dslog^/r) for s < r and ir(r) = 

■A/~ r \r/o) exp (—Dr) with some normalizing constant A/; 

2. Conditioning on s, sample S uniformly from Z s = {5 C \p\ : |5| = s, det(A^A* 5 ) > 0} 
if s < r and set S = [r] if s = r; 

3. Conditioning on (s, S), sample /3s ~ f s ,s, A with f Sl s,\(Ps) e -A ll x * s/3s, H and set /3s = = 0. 
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The prior n is similar to the exponential weights used for sparsity pattern aggregation by 
[45, 46]. Compared with the prior in Section 5.3, it has a modified weight on the model S = [r], 
which captures the intrinsic dimension of the matrix X. Assuming the data generating process 
(15), we have the following result implied by Theorem 4.1. 


Corollary 5.12. For any (3* with support S* € Z s * and sparsity s* = |S'*| < r, any f*, any 
constants A, p > 0 and any sufficiently small constant 5 € (0,1), there exists some constant 
D\ ^ p only depending on A ,5, p such that 


< 


™ (n./> - ni > (1 + d)\\f/3* - ni+ m{^~ a g * log( n ep/g * } ) 

exp (-C 1 (n\\fp - f*\\l + r A s* log 



for any constant D > D\ t s, P with some constants M,C' only depending on \,5,p,D. 


Since rank(X) = r, it is sufficient to establish the posterior oracle inequality for all (3* 
with sparsity s* < r. Due to the modified prior weight on the model S = [r], Corollary 
5.12 has a better convergence rate than Corollary 5.5. The corresponding frequentist results 
[45, 46] have leading constant 1 instead of the (1 + 5) in Corollary 5.12. Since our prior has 
a subset selection step, the presence of an extra small constant 5 cannot be avoided [45]. 

Let us specialize Corollary 5.12 to various types of aggregation problems. Following 
the notation in [51], define the simplex A p = {/3 € : YljPj = 1 ,/3j > 0} and the i 0 

ball B o(s*) = {(3 € MP : |supp(/3)| < s*}. Then, we consider model selection aggregation 
©(ms) = Bo(l) © A p , convex aggregation 0( C ) = A p , linear aggregation @( L ) = sparse 
aggregation 0(i_ s ) = Bo(s*) and sparse convex aggregation 0(c s ) = Bo(s*) n A p . For these 
aggregation problems, define the rate function 


A n ,p(0) 


r logp 

n ’ 


\j^ l ° s ( 1 + ^) 


r_ 

n 5 

s *log rg 




s*log 


© 

0 

0 

© 

0 


®(MS) 5 

©(c); 

0(l); 

e (u); 

%©• 


Corollary 5.13. Assume max,-^ ||/j|| n < 1. For any f*, any 0 € {®(ms),&(C),&(l),Q(l s ),&(C s )}, 
any constants X, p > 0 and any sufficiently small constant 5 € (0,1), there exists some con¬ 
stant D\^ p only depending on A, 5, p such that 


ELI (11/0 - r\\l >(1 + 6 ) mf H/0 - r\\l + M (a j1iP (0) a 0 | y\ 

C'n (mf ||/^-r||2+A njP (0)A0) 


< exp — 


for any constant D > D\^ tP with some constants M,C' only depending on \,5,p,D. 
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Corollary 5.13 provides a universal aggregation result with a single posterior distribution. 
The rate is minimax optimal according to [45, 55]. Bayes aggregation was recently studied 
by [57] under the model misspecification framework [32]. Corollary 5.13 is a stronger result 
of posterior oracle inequality under weaker assumptions compared with that of [57]. Other 
types of aggregation results such as l q aggregation can also be derived directly from Corollary 
5.12. The details are omitted in this paper. 


6 More results on sparse linear regression 


In this section, we provide some further results on posterior contraction rates for linear 
regression under the loo norm ||• ||oo- First, let us consider the sparse linear regression setting 
Y = X/3 + W in Section 5.3. Convergence under the norm requires stronger assumptions 
than convergence under the l 2 norm. Following [19, 34], we assume the mutual coherence 
condition: 

n~ 1 XjjX*j = 1 for all j € [p\ and ma< r. (17) 

Assuming the data is generated by Y = Xf3* + W for some regression coefficient f3* with 
sparsity s* and some noise vector W satisfying (5), the posterior contraction under the 
norm by using the prior distribution specified in Section 5.3 is given in the following theorem. 


Theorem 6.1. For any r > 0 and any f3* with sparsity s* satisfying ts* < 1/9 and any 
constants A, p > 0, there exists some constant D\ p > 0 only depending on A ,p such that 


En 


-p* 


> M 


log p 


n 


Y \ <p 


-a 


for any constant D > D\ )P with some constants M,C' only depending on X,p,D. 

The result of convergence under the loo norm is obtained under the assumption ts* <1/9. 
Such assumption was also made in [19, 11, 34, 15], and it implies the restricted eigenvalue K 2 
defined in (11) to be bounded away from 0 [59]. The convergence rate is optimal under 

the loo norm. Moreover, with a standard minimal signal strength assumption, Theorem 6.1 
immediately implies model selection consistency under the posterior distribution. 

While the optimal convergence result for loo norm is well known in the frequentist liter¬ 
ature for sparse linear regression, an analogous result for regression with group sparsity is 
perhaps still open. We provide a Bayes solution to this problem. For simplicity of presenta¬ 
tion, we consider the case of identity design Y = B + W £ M pxm , and the result for the case 
of a more general design can be derived in a similar way. For any subset T C [p] x [m], let 
r(T) = {i € [p] : ({i} x [m]) fl T / 0} denote the the rows selected by the set T. The prior 
II we use is defined through the following sampling procedure. 


1. Sample T ~ it in {T : T C [p] x [m]} with 

n<yT ) ^ r(|r|/2) exp ( -jD ( m l r ( T )l + l r ( r )l lo g ]7/fy + \ T \ lQ g en? J^ r ^ )) ; ( 18 ) 
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2. Conditioning on T, sample B? ~ fr,\ with f t,\{Bt ) oc e A anc j se t ^ Tc — q. 

Compared with the prior distribution specified in Section 5.4, the model selection step for 
the above prior has a two-level structure. Apart from the correction factor the 

probability mass (18) can be viewed as the product of e -D|S|(m+log^) ^ e _D|T|log^ 
with S = r(T ) denoting the row support. Therefore, (18) can be understood as first picking 
a row support S, and then further select a finer support from S x [m\. In comparison, the 
prior specified in Section 5.4 does not have the second step. While it only produces B with 
support in the form of S' x [m] for some S, (18) can give an arbitrary support T, which is 
critical to obtain optimal convergence rate under the loss. Let us assume the data is 
generated from Y = B* + W for some B* with row support S* and noise matrix W satisfying 
(5), the posterior contraction rate is given in the following theorem. 


Theorem 6.2. For any B* with row support S* and sparsity s* = |S*|, any arbitrarily 
small constant d > 0 and any constants X,p>0, there exists some constant D\ s,p > 0 only 
depending on X, 5, p such that 


ELI (jr(T)| > (1 + «5)s* y) < exp (-C's* (m + log J)) , 


En (||B - B* ||| > Ms* (m + log 


Y ) < exp ( — C's* ( m + log 


and 


En (||B - nloo > My/log(p + m) y ) < (pm) 


\-C" 


(19) 

( 20 ) 

( 21 ) 


for any constant D > D\^ p with some constants M,C',C" ,C"' only depending on \,5,p,D. 

To the best our knowledge, this is the first procedure that achieves the optimal rates 
simultaneously for both and ^oo losses. The e ( T " ,-l °" Tsl) part in (18) preserves the 

7“\|/"7-i| I GTYl | S | 

group sparse structure and results in the optimal £2 result (20). The e ° s l T l part in 
(18) does a further model selection in a finer resolution, thus giving optimal rate for each 
coordinate in (21). The subtlety of the simultaneous adaptation under both global and local 
loss functions is not reflected in an ordinary sparsity setting. When m = 1, group sparsity 
reduces to ordinary sparsity and the two-level model selection prior II is equivalent to the 
prior in Section 5.3, so that a one-level model selection would be sufficient for the task. 


7 Proof of Theorem 4.1 


Let us first introduce some notation and give the outline of the proof. Using the fact that 
e -\W Y -&z{Q)W 2 


e -\\\v-%z* (Q’)ll 2 


= e -\\\Kz(Q)-Xz*(Q*)\\ 2 +{Y-%'z*{Q*)^z(Q)-&z*(.Q*)) 


we can rewrite the posterior distribution as 

£ rer eM- D <Z T ))jt\ R ( Z ’ U ) 


n(3T z (Q) £ U\Y) = 


Y.T&T exp(-L>e(Z r )) Y,z&z t r ( z ) 


( 22 ) 
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where R(Z,U) is defined by 


^det(-^=) 


1 {Z T ) 


'&z(Q)eU 


e -\\\^z{Q)-Xz*(Q*)\\ 2 +{y-^ Z *{.Q*),Xz(Q)-Xz*{Q*))-\\\^zm\ d Q^ 


and R(Z) = R(Z,W N ). Moreover, for a class of structure indexes A C T, its posterior 
distribution can be written as 


n(r € A\Y) 


EreA e M-De{Z T ))A- { Ezez T R ( z ) 
^ Tg7 -exp(—De(-Z T ))]^-| EzgZt r ( z ) 


(23) 


We are going to work with the formulas (23) and (22) to prove (7) and (8), respectively. 
The main strategy is to lower bound R(Z*) in the denominator and upper bound R(Z) or 
R{Z, U ) in the numerator given some events holding with high probability. For each Z € Z T , 
consider the following events 

E z = {| (W, 3E Z {Q) - &z*(Q*)) | < y/^ZriW %z(Q) ~ %zAQ?) II for all QeR^}, 
Fz = {| (W, %z{Q) - Xz*{Q*)) I < y/e*(Z r .)\\&z(Q) ~ &z*(Q *)II for all Q € R^} , 

where e*(Z T ) = C 1 e(Z T )+C 2 \\^(Q*) ~ 0*f and e*{Z T *) = C 1 e(Z T *)+C 2 \\%'z*(Q*) - 0 * || 2 
for some constants C\,C 2 to be specified later. The next lemma shows that both events hold 
with high probability. 


Lemma 7.1. For any constants C\ > 1 and C 2 > 0, the conditions (4) and (5) imply 

P (E c z ) < 2 exp (— (pCi/16 — 5)e(Z r ) — pC 2 \\3E z *{Q*) ~ 0*\\ 2 /\§) , 

P {F c z ) < 2exp (5£(Z t ) — pC\e(Z T *)/ld> — pC 2 \\ZF z *{Q*) ~ #*|| 2 /16) . 


We also need a lemma to characterize the growing rate of e(Z T ). 
Lemma 7.2. For any f3 >2 and a > 1, the condition (6) implies 

Y ex P (P e ( z r)) < 4[a] exp(/3|af|); 

{reT :e(Z T )<a} 

Y exp (—fje(Z T )) < 4aexp(-/3|_aJ); 

{reT :e(Z T )>a} 

Y exp (-0e(Z T )) < 6. 

{TGT:e(Tr)<a} 

The proofs of Lemma 7.1 and Lemma 7.2 are given in Section 9. 
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Lower bounding R(Z*). For Z* € Z T * with any r* € T, we lower bound R(Z*) by 
/ r=\t{Z T *) 

(y) R ^ 

= y / det( c rJ*JT z *) J e~^ Xz * (Q*)\\ 2 +( Y ~^z* (Q*)Az* (Q)-X z *{Q*))-A\^z* (O)llrfQ 

= yjdet{&£.3£ z *) / g—(Q)|| 2 +{ y '—Xz* (Q*),X Z * (Q))—X\\Xz* (Q)+Xz* (Q*)|| (24) 

> e -A||^z*(0*)ll^ det (^T t ^) j e ~\\\X z * (Q)W 2 + (Y-Xz*(Q*),X z *(Q))-\\\X z * {Q)\\ d Q (25) 


_ g— A||^ Z *(Q* 


/' 


i||b|| 2 +<Y—«- z .(0*),6>-A||6|| db 


> | e -|lN 2 -AW d6e xp^(F-«r z »(g*),6) 

Je-m 2 -m\ db . 


-i||b|| 2 -A||6|| 


e 2 


J e -ill&ll 2 -A||6|| d6 


-d6 


— g — A||^z*(Q 


(26) 

(27) 

(28) 


The equalities (24) and (26) are due to changes of variables and the linearity (2). We also 
use triangle inequality and Jensen’s inequality to derive (25) and (27), respectively. The last 
equality (28) uses the fact that the distribution - — {.( p — - ■ is spherically symmetric so 

that its mean is zero. Finally, let us lower bound the integral f e — 2 ll^ll -*W b W db by 


e-hM 2 - 


MW\ db = 


2 n £(Z T ,)/2 roo 


j 


A Z T*)-' e -\r 2 -\r dr 


mZr*)/2) Jo 

> ^ e{Zt * ] r e(z T *)-i 

Jo 


T(1{Z t *)/2) 

2^)12 [f ( Zr>)] ^.)/2 A VI ^ ) 

i{z T *) r{e(Zr*)/2) 

> 2 ( 2ir ) e{Zr * )/2 c -he(z^)-\y/nz^) 

£{Zr*) 

Combining the above lower bound with (28), we reach the conclusion 

R(Z*) > e -A||^ z *(Q*)l|-(i+A+A- 1 K(2: r ,)_ 


dr 


(29) 


Note that (29) is a deterministic lower bound for the denominator R(Z*). The arguments 
we have used to derive (29) are greatly inspired by the corresponding ones in [14, 15]. 


Upper bounding R(Z)Ie z . To facilitate the analysis, we introduce the object 

Qz = argmin || 3£ Z (Q) ~ ^z*(Q*)\\ 2 - (30) 

Q GR e(z T ) 

The property of least squares implies the following Pythagorean identity, 

II %z{Q) ~ %~z*(Q *)|| 2 = || %z{Q) - %z{Qz )f + \\&z{Qz) ~ ^z*(Q*)\\ 2 - (31) 
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We first analyze the exponent in the definition of R(Z ) on the event Ez by 


~\Wz{Q) - &z*(Q*)\\ 2 + {Y- &z*m, &z(Q) ~ %z*{Q*)) - X\\^z(Q)\\ 


= ~\\X Z {Q) - ^z*(Q*) || 2 + (W, X Z {Q) - &z*(Q*)) 

+ (0* - JT Z *(Q*), & Z (Q) - &z*(Q*)) - M\&z(Q)\\ 

< -^\\^ z (Q)-^z*(Q*)\\ 2 + (V^r) + ^)\\^z(Q)-^z*(Q*)\\ ( 32 ) 

+\\0* - %- z *(Q*)\\\\&z(Q) - $r z *(Q *)II 

-\\\3£ Z (Q)W - A|| &z(Q) - &z-(Q*)\\ 

< 2 (Ve*C^r) + A) 2 - ^ ^ \\^z(Q) - =^z»(Q*)|| 2 (33) 

+ 2 ||r - ^Z*(Q*)|| 2 + ^II^Tz(Q) - ^Z‘(Q*)|| 2 - All^(Q*)II 

< (4 + 2/C 2 )e*(Z r ) + 8A 2 -i||jrz(Q)-^*(Q*)|| 2 -A||^*(Q*)|| (34) 

< (4 + 2/C' 2 )e* (Z T ) + 8A 2 -^|| &z(Q)~&z(Qz )|| 2 - A|| 3? Z *(Q *)||. (35) 


We have used Cauchy-Schwarz inequality and the event Ez to get (32). The inequality (33) 
is due to the fact ab < 2 a 2 + b 2 /8 for all a, b > 0 and triangle inequality. By rearrangement 
and the fact C 2 ||0* — ^Z’z*(Q*)\\ 2 < e*(iJ T ), we obatin (34). Finally, the inequality (35) is by 
the identity (31). The above upper bound implies 


R(Z)Ie z < 


/ x \£( z t) 

[ _ \ p (4+2/C 2 )e*(^)+8A 2 -A||^ z ,(Q*)|| 

\V^J ' 

XyJdeti&Z&z) J e--^ Xz{ ^~ Xz[ Q z ^ 2 dQ 

/ \ \ t{z T ) f 

e (4+2/C 2 )e*(^)+8A 2 -A||^(Q‘)ll j e -\M 2 db 

(2\Y^ Zt ^ ) e *(^ t-)-I-8A 2 — A||(Q*)|| 


Using the fact that i{Z T ) < e*(Z T ) by (4), we reach the conclusion 

R{Z)I Ez < e ( 4 + 2 /C2-Hlog(2A)|) e *(^)+8A 2 -A||^,(Q*)ll. 


(36) 


Upper bounding R(Z,U)Ip z . Let us fix U to be 


C/ = {|| t r z (Q)-r || 2 > (l + < 5 2 )||^*(Q*)-r || 2 + Me(Z T *)}. 
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Let £ 6 (0,1/4) be a constant to be specified later. When both F z and U hold, the exponent 
in the definition of R{Z, U ) is bounded by 


-\\\&z(Q) - &z*(Q *)|| 2 + (Y- Q*), 3Fz{Q) - &z*(Q*)) - A||2Tz(Q)|| 

-\tWz{Q) - &z*(Q*)\\ 2 + {W,3T z {Q) - &z*(Q*)) + (0* - %- z *(Q*),%- z (Q) - & Z *(Q*)) 

—o- - m^z(Q) - % z *mt - aii ^ z ( q)ii 

< -\tW&z{Q) - & Z *m II 2 + (v/F(^) + A)II^Tz(Q) - ^z-(Q*)|| (37) 

(r - ^z*(Q*), ^z(Q) - ^ Z .(Q*)> - ^(1 - oil &z(Q) - & z *m || 2 

-A|| JT Z (Q) - & Z *{Q*) || - A|| JT Z (Q)|| 

< r 1 + a) 2 - \tWz{Q) - ^z-(Q*)|| 2 (38) 

-^(1 - OII-^(Q) - + ^(1 + Oll-^HQ*) - 6*\\ 2 + Z(&z{Q) - 0*,0* - & Z *(Q*)) 

—A||^z*(g*)|| 

< r 1 + a) 2 - ^ll^z(Q) - % Z *mt - A||^Tz*(Q*)|| (39) 

-^(1 - 20ll^z(Q) - e*w 2 + \{1 + 20ll^z-(Q*) - 0*ll 2 

< &5i l \ 2 -\Me{Z T *)-)-5 2 \\3F z *(Q*)-0*\\ 2 (40) 

o Z 

^ 2 || JT Z (Q) - ^z(Qz)|| 2 - A||^Tz*(Q*)ll- 

16 

We have used Cauchy-Schwarz inequality and the event Fz to get (37). The inequality (38) 
is due to the fact ab < a 2 + b 2 /4 for all a,b > 0 and triangle inequality. Then, (39) is by 
rearranging (38). Finally, we have set 

£ = and C 2 = ^5% (41) 

and used (31) to obtain (40) on the event U for all M > 64<5^ 1 C'i. Using the above bound, 
we have 


R(Z,U) I Fz < 


/ x \ t{Z T ) 

( _Z_ 1 (O -A||^X0*)||+85 2 - 1 A 2 -lM6(Z T *)-i52||«z*(Q*)-e*|| 2 

KV^J 


x^/det {SF^SCz) J e-TstoW^zM-sr^Q^W 2 d Q 


, ■, \ t{Z T ) 

4A \ e -A||S z ,(Q*)||+8«5 2 - 1 A 2 -iM e (Z T *)-^2||^.(Q*)-0*ll 2 

VhJ 


by the same argument in deriving (36). By l(Z T ) < e*(Z T ) from (4), we reach the conclusion 


R(Z,U) l Fz < e -M\^z*(Q*)\\-^Me{Z T *)-±6 2 \\%- z *(Q*)-6*\\ 2 


(42) 
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for all M > max{64<5 2 1 C'i, 16 log (4 A/y^) + 1285 2 1 A 2 }. 

After obtaining the bounds (29), (36) and (42), we are really to prove the main results. 

Proof of (7). First, we use (29) and (36) to bound the ratio R(Z)I Ez /R(Z*). 

MC 1 +2C 1 /C2+C 1 \\og(2X)\}e{Z T )+[AC2+2+C 2 \\og<,2\)\\\\S: z ,{Q*)-e*W 2 


. R(Z)I e 


z < e^ 2 \Z' e 


R(Z*) " “ '“ T '' g-(l+A+A-i )i(Z T .) 

< e 8A2 exp ((1 + A + X~ 1 )e(Z T *) + C[e(Z T ) + C' 2 \\& Z *{Q*) ~ 0 *\| 2 ) , 

where C[ = 4Ci + 2C\/C 2 + C\ \ log(2A)| and C' 2 = 4:C 2 + 2 + C 2 \ log(2A)|. Let us use the 
formula (23) with 

A={e(Z r ) > (1 + Si)e(Z T *) + Si\\£Zz*(Q*) — #*|| 2 } . 

By Z* € Z T *, we have 


FTT /_ r AW} ^ ex P (~ De ( 2 r)) I Z T * | R(Z)I Ez 

mT€A]Y) s Se XP ( -^.)) W z §, E ^r 

+EE 


(43) 

(44) 


We use Lemma 7.2 to bound (43) by 

exp (8A 2 + (D + A + A" 1 + 1 )e(Z T *) + C' 2 \\& z * (Q*) ~ 0* || 2 ) exp (-(D - C[)e{Z T )) 


rS.4 


< 4e D+8A2 exp (- ((D - C\ - l)<5i - C' 2 ) || 5t? z *{Q*) ~ e*\\ 2 ) 
x exp (- ((D -C\- 1)(1 + Si) — (D + A + A" 1 + 1)) e(Z T *)) 


< 4e-° +8A2 exp 


44i|| x z .(Q-)~rf- s 2f e (z T .)), 


for D > max 


| a+a — +\+2(c 1 +i) , 2(C[ + 1) + ^ j. Using Lemma 7.1, Lemma 7.2 and (4), 


the second term (44) is bounded by 


2exp (-C 2 \\3r z *(Q*) - 0* || 2 /16) ^ exp (-(pCJ 16 - 6)e(Z r )) 

t£A 

< 8e 14 exp (-^11 3Tz*(Q*) ~ 9* || 2 - 7e(Z T *)) , 

for C\ = max{l, 224/p} and the value of C 2 is set in (41). Letting S 2 = 8-y/<5i /p = 8^5/p, 
we obtain the desired result by combining the bounds of (43) and (44). □ 

Proof of (8). Let us first use (29) and (42) to bound the ratio R(Z, U)I Ez /R(Z*). That is, 
R{Z R ^ FZ < ex P (M/16 — (1 + A + A -1 )) e(Z T *) — ^S 2 \\& z *(Q*) — 0*|| 2 ^ 

< e W (-^e(Z T *)-^2\\%- z *(Q*)-0*\\ 2 
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for M > max {64<5 2 1 C\, 16 log(4A/\/^) + 128<5 2 X A 2 , 32(1 + A + A 1 )}. By the formula (22), 
we have 


EU(U\Y) 


< 


E 

reT n A c 


exp (— De(Z r )) \Z T * 
exp (— De(Z r *)) \Z T \ 


v R(Z,U) I Fz 

2-* R(Z*) 

zez r 


(45) 


+ E E 1P ( F z ) ( 46 ) 

rer nA c zeZr 

+EII(r € A\Y) (47) 

The bound of (47) has been derived in the proof of (7). Using Lemma 7.2, we bound (45) by 

exp (- ~ d\ e(Z T *) - ^8 2 \\^z*(Q*) -0*|| 2 ) E exp (-U>e(Z T )) 

< 6exp (~^<Z T *) - ls 2 \\&z*m - r|| 2 ) , 

for M > max {645^" 1 C'i, 16 log(4A/\/52) + 1285^" 1 A 2 , 32(1 + A + A -1 ), 64Z)}. Using Lemma 
7.1, Lemma 7.2 and (4), the term (46) is bounded by 

2exp (~pCie(Z T *)/16 - pC 2 \\3£z*(Q*) ~ #*|| 2 ) E exp (5 e(Z T )) 

reT n A c 

< 8e 6 exp (&- - 2 ) e(Z T *) - ( P C 2 - 6,) \\^(Q*) - *1 2 ) 

= 8e 6 exp (-12e(Z r *) - <5i|| 3?z*{Q*) ~ 0 *|| 2 ) , 

by the relation C 2 = 5 2 /32, C\ = max{l, 224/p} and 82 = 8y/5\/p = 8 \fbjp. The proof is 
complete by combining the bounds of (45), (46) and (47). □ 


8 Proofs of corollaries 

Proofs of Corollary f.l and Corollaries 5.1-5.7. Corollary 4.1 is a direct consequence of The¬ 
orem 4.1 by letting 6* = 3Zz*{Q*)- Except Corollary 5.4, Corollaries 5.1-5.7 are special 
cases of Corollary 4.1 in different model settings. By the definitions of and we have 
||/3 — /3*|| 2 < Kf 2 \\X(3 — X/3*\\ 2 /n and ||/3-/3*||f < Kf 2 s*\\Xp - Xfi*\\ 2 /n, which implies 
Corollary 5.4 from Corollary 5.5. □ 

Proof of Corollary 5.8. For any £, recall that f(fi, fj) = Q%j = Q z (i)z(j) ■ Then, (8) of Theorem 

4.1 implies that 

E (/&>&■) - n&Zj)) 2 < a+ s 2 ) e (q*^u) - r(M)) 2 + M ((*o 2 +mo g k*) 

i,j i,j 

under the posterior distribution for any k* G [n], any z* € [k*] n and any Q* G ) 2 . Lemma 

2.1 of [23] implies there exist some z* G [ k*] n and some Q* G R( fc *) 2 such that 

E (<3;-(.vo) - /•(&•&)) 2 S o 3 lV (1)“ A1 , 

i,i v 7 
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for any f* € J- a (L) and some absolute constant C 3 > 0. Therefore, 


niutj)) 2 < M ' 

1 

The proof is complete by choosing k* = [n aA1 + 1 ]. □ 

Proof of Corollary 5.9. The case q = 0 is Corollary 5.5. We consider q £ (0,1]. For the 
effective sparsity defined in Section 5.8, ( 8 ) of Theorem 4.1 implies that 

\\Xp-X/3 *\\ 2 < (l + 5 2 )\\Xp 0 -Xp*\\ 2 + Ms* log^ 

under the posterior distribution for all /3q € £>o(s*). By the “Maurey argument” (see Lemma 
7.2 in [51]), 

-\\XPo - Xj3*\\ 2 < L 2 k 2 /q (s*) 1 ~ 2/q , 
for all /3* € B q (k). Therefore, 

-\\XP - XP *\\ 2 < M' (k 2 /q (s*) 1 ~ 2/q + S lQg ^ 
n \ n 

By the definition of the effective sparsity s*, we obtain the desired result. □ 




Proof of Corollary 5.10. First, we note that by slightly modifying the proof of Theorem 4.1, 
we can have a more general version of ( 8 ), which is 


Efl 


0*f > {l + 5 2 )\\^z*{Q*)-0 *\\ 2 + Me{Z r ,) + t\Y^j 


< exp {-C" (e(Z r *) + \\3£ Z *(Q*) - 0 *\\ 2 + t)) , 


(48) 


for all t > 0. For every j < log 2 n, the model induced by the prior can be represented in 
the general framework by letting Zj = Sj, Tj = Sj, Tj = [2 J ], Z Sj = {Sj C [2 J ] : \Sj\ = Sj }, 
l(Z Sj ) = Sj and Qj = \fn0js y Then, we have the representation iXzji.Qj) = 

The complexity function is 6j(Z Sj ) = 2sjlog^-, which satisfies (4) and (6). By (48) and 

1 

letting t = n 2a +! /log 2 n, we have 


En n\\e jif -eu \ 2 > (i + <$ 2 )n|| 0 _ 


3* 


< exp —C 1 


,n 2a + 1 \ 
log 2 n ) ’ 


0 *J 2 + 2 Ms* log % + 


1 

n 2ck + 1 


log 2 n 



for any 0j * € M 2J with sparsity s*. Since 9* € 0“ ? (L) implies ||0^|| p < L2 aj . we have 


||0j* — 0*j*\\ 2 < C*r 2 j^{L2 


n~ 


- 1 / 2 ) 
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for some absolute constant C* > 0 by the proof of Theorem 11.7 in [30], where r 2 y p (L2 a ' J , n 1 / 2 ) 
is the control function defined in Section 11.5 of [30]. Therefore, 


En(G^,) < exp ( -C" 


n 2a + 1 

log 2 n 


for all j < log 2 n, where 

Gj = \ \\d jt - 9*£ < M'r 2j p {L2~ aj ,rT 1/2 ) + 


2a 


n 2q: + 1 
log 2 n 


Moreover, II(0j* = 0|hj*) = 1 for all j > log 2 n by the definition of the prior. Using the 
independence structure of the posterior distribution, we have 

En ((n J - <loga nG i ) c |y) < E En ( G il y ) = E En ( G ^*) 

j <log 2 n l<log 2 n 


< (log 2 n) exp -C" 


n 2a + 1 
log 2 n 


/ ^,n 2 “+i 

< exp — C - 

\ logn 


Finally, the event nj<i og2 n Gj and 9j * = 0 for all j > log 2 n implies 


\\e-e*t < E ll^-^ll 2 + E II 0 ; 


* 112 
j* II 


j<log 2 n 


j>log 2 n 


2a 

n 2 “+ 1 


< w E (wL 2 -y„-v 2) + iL_x )+ V || 9; 


j<log 2 n ' 
// _ 2q: 

< M n 2 “+i, 


* 112 
j* II 


J>log 2 n 


where the last inequality follows the proof of Theorem 12.1 in [30] under the assumption 
a > 1. Hence, the proof is complete. □ 

Proof of Corollary 5.11. Let us write the model induced by the prior distribution in the 
general framework by letting Z = [k], r = k, T = [n], Z k = {[A;]}, £(Z k ) = k and Q = y/nO^y 
Then, we have the representation SZziff) = V™{Qfk]i ^fk] c ^ T ‘ The complexity function f\Z k ) 
is 2k, which satisfies (4) and (6). Then, (8) of Theorem 4.1 implies that 


En (n||6» - r || 2 > (1 + 5 2 )n\\9 - d*\\ 2 + 2Mk* 


y) <exp(-c"(r + ||£-r|| 2 )) 


for any 9 satisfying 9j = 0 for j > k*. Since 9* € S a (L), there exists some 9 satisfying 9j = 0 

for j > k* such that || 9 — 9*|| 2 < L 2 {k*)~ 2a . Therefore, || 9 — 9* || 2 < M' (( k*)~ 2a + ^-) under 

1 

the posterior distribution. Letting k* = [n 2a + 1 ], the proof is complete. □ 


Proof of Corollary 5.12. Note that the model induced by the prior distribution can be written 
in a general way by letting Z = S, t = s, T = [r], Z s = {S C \p\ : \S\ = s} if s < r and 
Z r = {[r]}, £(Z S ) = s and Q = /3s- Then, we have the representation &z{Q) = X*sPs = X/3. 
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The complexity function we choose is e(Z s ) = 2s log ^ for s < r and e(Z r ) = 2r. It is easy 
to check that e(Z s ) satisfies (4) and (6). Using (8) of Theorem 4.1, we have 


< 


ran (j| fp - nil > (1 + S 2 )\\fp- - r\\l + 2 M s * log[e n P,s 

exp (-G" (n\\f/3* ~ f*\\n + s * lo s|?)) > 



for any /3* with sparsity s*. For this j3*, there exists some /?i such that supp(/?i) C [r] and 
f/3* = // 3 i ■ Therefore, (8) of Theorem 4.1 implies 


En (\\f p - f*\\l > (1 + <y 2 )||/ft 
< exp (-C" {n\\ffr - f*\\ 2 n + r)) . 


|2 

In 


r 

+ 2M- 
n 



Combining the two results by union bound, the proof is complete. □ 

Proof of Corollary 5.13. Using the corresponding arguments in [46, 51], Corollary 5.13 is 
implied by Corollary 5.12. □ 


9 Proofs of technical results 

Proof of Lemma 7.1. Consider Qz defined in (30). Then, we have the bound 

\(W,&z(Q)-3r z *m)\ 

< \\&z{Q) - &z{Qz)\\ ( W, 


+11 %z{Qz)-%z*{Q* 


&z(Q) ~ %z{Qz) 

II %z{Q) - %z{Qz) 
<rz(Qz)-<rz*(Q 


II %z{Qz) - %z*{Q* 


< max 


W, 


%z{Q) - % z (Qz) 
\\&z{Q) - &z{Qz)\ 


W, 


^z(Qz) - 

\\&z(Qz) - &z-(Q*)\ 


xV2y/pWQ) ~ &z{QzW + Wz{Qz) ~ &z*(Q*W 


= V2 


max 


3Pz(Q)~^z(Qz) \ 
, \\^z(Q)-^z(Qz)\\/ 


W, 


^z(Qz) - &z*(Q*) 
\\&z(Qz)-#z-(Q*)\ 


where the last equality is due to (31). By (5), 


TV (Q*) 

’ \\&z(Qz)-&z*(Q*)\\ 


< 


ii &z(q) - &z*(or 

-^yJe*(Z T ) with 


probabilty at least 1 — exp(— pe*(Z T )/A). Now it is sufficient to bound 

\(W,& Z (Q))\- 


sup 

QeR e ( z -r) 


&z(Q)-&z(Qz) \ 
’\\&z(Q)-&z{Qz)\\/ 


sup 

,\\& z {Q)\\<i 


A standard discretization argument as Lemma A.l in [23] gives 

sup |(W, 3P Z {Q))\ < 2 max \{W, 3P z {Qi ))\, 
Qm e ( ZT U\^ z (Q)\\<i 1<1<L 
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where {Qi}i<i<l is a subset of {Q € : ||^^(Q)|| < 1} such that for any Q € M.^ 2 ^ 

with \\3Pz(Q)\\ < 1, there exists an l € [L] that satisfies \\3£z{Q ~ Qi) || < 1/2 and a covering 
number argument gives the bound L < exp (5 £(Z T )). Using union bound together with (5), 
we have maxi<;<£ | (W, 3£z{Qi ))| < 7 ^i/e*(iil r ) with probability at least 

1 - exp (5 £(Z r ) - pe*(Z T )/ 16) > 1 - exp {-(pCJ 16 - 5)e(Z r ) - P C 2 \\^(Q*) - 9*\\ 2 / 16) , 


where we have used the condition (4). Using union bound again, we have 


\/2max 


/ %z{Q)-%z{Qz) \ 

\ , mQ)-^z(Qz)\\/ 



&z(Qz)-&z*{Q*) \ \ 
\\2£z{Qz)-&z*m\\/ I 


<<We*(Z r ), 


with probability at least 1 — 2exp (—(pCi/16 — 5 )e(Z T ) — pC 2 \\ 3£z* (Q*) ~ 9* || 2 /16) . This 
leads to the bound P {E c z ) < 2 exp (— (pC\/lQ — 5 )e(Z T ) — pC 2 \\SPz* ( Q *) — #*|| 2 /16). A simi¬ 
lar argument also leads to the bound P(F|) < 2 exp (5 £(Z r ) — pCie(Z r *)/16 — pC 2 \\t&z* (Q *) — 9* || 2 /16). 

□ 


Proof of Lemma 7.2. The first inequality holds because 

£ exp (Pe(Z T )) < £ £ eP^ + e? 

{t eT-.e(Tr)<a} 



*= 1 { 

reT :(-l<e(2 T )<t} 


M 


< 

£ te pt + e 13 


t =l 


< 

2M 

e/? 

e/ 3 - 1 

< 

4 m 

exp(/3|a]), 


by /3 > 2. The second inequality holds because 


£ exp (~/3e(Z r )) < £ £ e~^ 

{rGT:e(Z T )>a} t=Y a \ {r&T:t<e{Z T )<t+ 1 } 


< £ (t + 1)< 


-P t 


*=H 


< 


2 £ ex p (- f/ 3 - 


t=faj 

< 4aexp(/3|_aJ), 


log L«J 
L a J 


(49) 


for /3 > 2 and a > 1. The inequality (49) is because logi < lo f t for all t > |_aj. Finally, 

OO 

£ exp (—/3e(Z T )) < 1 + £t e -^” 1 ) 

{r ST:e(T r )<a} 


t= 1 


< 


i + ^E' 


-09-1 )t 


t= 1 


< 6, 
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for j3 > 2. 


□ 


10 Proofs in Section 6 


Proof of Theorem 6.1. The assumption ts* <1/9 and the argument in the proof of Theorem 
1 of [34] implies 

| S |<7S),. 11 II.P < |s| ll(n- 1 Xj s X, s )- 1 || ( , < 4 (50) 

for 6 < 1/4. Define Ps = min;, ||Y — X*s'6|| 2 . Then it is easy to see that ||Y — X^sPs || 2 = 
||Y-X* s /3s|| 2 + \\X*s(Ps ~ Ps)\\ 2 - Define the distribution £(/3s,X*s,A) of Ps that has 
density function 

exp (-±||X* s /3s - X* s Ps \\ 2 ~ A||X*s/3 s ||) 

f exp (-l\\X* s /3s ~ X*s/3s \\ 2 ~ AHX^AsII) d/3 s 

Then, according to the formula of the posterior distribution, to sample P from the posterior 
distribution is equivalent to first sample S from n(S'|Y) and then sample ps C{Ps,X*s, A) 
to form p 1 = (Ps,0g c ). Hence, the posterior distribution can be represented as 

£n(S|Y)n s (-|Y) = J2u(S)£0s,X*s, X)®S S c, 

s s 

where n(£|Y) = oj(S) and ns(-|Y) = C(Ps, X^g, A) ®ds° with 

*(|S|) / A V s ' 


uj(S) oc 


\Z\ 


|5|l 


n) * 


X*sPs, A 


e -$\\Y-X.sM‘ 


I{|%||>0}. 


The number J\f v \ for any vector y and any scalar A is defined as 


My,\ = f ex p (-^11* - y \\ 2 - A H t ll) dt 


Define the event 


E = < max 
ie[p] 


x Jw - -' 

3 < Ciy/logp 


\/n 


(52) 

(53) 

(54) 


for some constant C\ > 0 to be determined later. We have 
En ( \\p-p*\\ 00 >M ] 


log p 


n 


Y 


E Y co(S)U 3 ( || Ps 

|S|<(l+(5)s* V 


V 


> M 


logp 


n 


Y +En(|5| > (1 + 5)s*|Y) 


< E Y, u(S)Us(\\Ps-Ps\\oo>^mJ^ 

|S|<(l+5)s* V U 

II/3ScIIoc<c 2 ^1ST 

+P (E c ) + En(|S| > (1 + (5)s*| Y) 


Y\I e + E Y ^(S)i e 

J |S|<(l+<5)< 

||/3Scl|oc>c 2 /^ 

(55) 


33 

















for some constant C 2 > 0 to be determined later. The inequality (55) is due to the inequality 
11 As - Aslloo < II As ~ Aslloo + IIAs - Asll ancl 

EC |IIAs - ASIIco < > (56) 


for all S that satishes |£| < (1 + 5)s* and HAs^ lloo < Let us give a proof for (56). 

By the definition of A s, we have XjgX^sPs = x JsY = Xj s X*sP*s + Xj s X^s c P*sc + Xj s W , 
which implies 


IIAs - Aslloo < 4|| Xj s X*s(Ps - As)lloc/n < -\\xJ s X*scP*sc 


| 00 + -||^s^|| c 


Note that ±\\Xj s X.scp* S c\\oo = £||X* T s^*S*nS=A|*ns4oo < 8s*r||/% c ||oo < C 2 y/^ due to 

IIAsAloo < We also have ^X^WU < £ max^j \XjW\ < 4Ci There¬ 

fore, (56) is proved for some M/2 > AC\ + C 2 . 

In view of (55), it is sufficient to bound the four terms in (55). The last term is bounded 


as a result of (9). The third term is bounded by p V 2 / using (5) and a union bound 

argument. Let us give a bound for the first term. 


n 5 


(life - &IU > 



< 


< 


5Z n s (lAj - AA > \m 

jes V 


5Z ex p 

j&s 



tMyJlogp 





(57) 


where En s (-|U) is the posterior expectation with the distribution IIs(-|T) = C(j3s,X*s,X) 
and t > 0 is some number to be specified later. Using the formula of the density (51), for 
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any unit vector v £ Ml s l, we have 


E n , ( e V"*' T (0s-/9s) 


Y 


f exp (yntv T (/3 s - As) - \\\X* s p s - X*sAs|| 2 - A||X*sAsll) d Ps 


= e \# Mn- 1 X? s X. s ) 

< exp 

< exp 


f exp (^||X*sAs - X.sPs\\ 2 - A||X*sAs||) d Ps 

i/ 2l ,||2 / ex P (~iill^*s(As - As - in- 1 / 2 (n- 1 xj s x*s)- 1 v)|| 2 - A||X*sAs||) ^As 


/exp (-i||X*s(A- As)|| 2 - A||X*sAsll) ^As 


-t 2 ||(n- 1 X/ s X*s)- 1/2 ^|| 2 + Ai||(n- 1 X* T s X*s)- 1/2 «| 


^A 2 + f 2 ||(n 1/2 n|| 2 ^ 


< exp ( ^A 2 + 16i 2 ) , 


where the inequality (58) is due to a change of variable and triangle inequality and the 
inequality (59) is by (50). Specializing v so that u T (As — As) = ±(Aj — ft j ) : we have 


E n , (e^i-fa 


Y < E n , ( e 


,Ynt{Pj-Pj) 


Y) + E n , 


Y ) < 2e5 A2+lw2 . 


Letting t = v/log p, we have 


n s IIAs-Aslloc > 


log p 


n 


y <2 e A2 /V(f- 17 ), 


(58) 


(59) 


which bounds the first term of (55). 

Now, let us give a bound for the second term of (55). Given j = argmax^u | A/1 , for any 
S C [p] such that j ^ S , define S' = 5u{j}. We are going to provide a bound for a j(S)/uj(S') 
on the event E to argue the model S' is favored over the model S under the posterior 
distribution if |A/1 is large. Because of (50), \Z\g\\ = |-2)s|| = (|s|) f° r a ll \S\ < (1 + S)s*. By 

(52) , we have 

= n (\ S \) (|S'|) VI • A/ X tS S g ,A i||y_x t ,q/§,<;ll 2 + L|y-W g /SQ/ll 2 

(|§|) A ^ s , 0 s „x 

Since /ttwt < exp (2D log(ep)), Afiy < p, a/ . a * s/§s ’ a < e A H- x * B '^ s- - x *s , As'll by the definition 

(53) and a change of variable, and —\\\Y — X*sPs\\ 2 + \\\Y — X^s'Ps’W 2 = -gll^sAsll 2 — 

^||X*s'As'|| 2 ) we have 


< y^L( ep YD+i e \\\x tS p s -x tS ,p sl \\+l\\x fS Ps\\ 2 -h\\x* S 'P S '\\ 2 
w(S') A 


(60) 
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Let Ps and Pg/ stand for the projection matrix onto the column spaces of X*g and X„,g>, 
respectively. Then X*sfts = PsY and X^g'Ps 1 = Ps'Y■ Let F be the orthogonal complement 
of the columns space of X*s in the column space of X*gi, and then define Pp to be the 
associated projection matrix. It is easy to see that Ps> = Ps + Pf and PsPf = 0. Thus, 
the exponent of (60) equals A||PfY|| — |||PpY|| 2 < — |||PfY|| 2 + A 2 < —^\\ PfX / 3*\\ 2 + 
i||PpLL|| 2 + A 2 . We are going to give a lower bound on ||PpJf/3*|| 2 and an upper bound on 
IIPpWII 2 . To facilitate the proof, we bound ||-PsX„j|| 2 as 


IIPsX^-ii 2 = xTjX'sixTsX^-'xJsX'j 

< Mn^XJsX.jW 2 

s a * 2 s n 

< 8 ns t < — 

by (50) and ts* <1/9. The noise part ||PfVL|| 2 is bounded as 


II^WH 2 = 


< 

< 

< 


{I — P S )X* j Xj' j (I — P s ) 
\\(I-Ps)X ^-ll 2 
\XP(I - P S )W\ 2 
\\(i — P s )x*j \\ 2 
2\XPW\ 2 + 2\XPP S W\ 2 
9n/10 

8C 2 log p, 


(61) 


(62) 

(63) 


where (62) is because of (61) and (63) is derived from the event E and the following argument 
that 


\xPp s w \ 2 = \xTx*s{xT s x*s)- l xT s w \ 2 

< 16n\\X^X,s/n\\ 2 \\Xj s W/V^\\ 2 

< 32Cf (s*r) 2 n logp < ^C 2 nlogp 

by (50) and the event E. The signal part ||-PfW/ 3*|| 2 is lower bounded by 


\\p F xp*\\>m-Ps)x*j\\\p* 


y. 

i£S*n(Su{j}) c 


XT{I - P S )X* l 
||(J-P S )X, J -|| 


where the first term on the right hand side above is lower bounded by y / 9n/10|/3*| by (61), 
and the second term is upper bounded by 


E 

ies*n{j} c 


l« 


*l\ 


II (I-Ps)X, 


*3 I 


+ E 

leS*nS c 


IX^PsX^ 


II (I-Ps)X, 


< 7^/n\(5*\/9 


*3 I 


due to (61), (50), rs* < 1/9 and the fact |/3*| = max i6 [ p ] |/3*|. Therefore, ||.PfX/3*|| > 
V^\Pj\/7- When |/3*| > we have -|||PfA:/3*|| 2 + \\\P F W\\ 2 < -2C 2 logp. 
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Plugging this bound into (60), we have < ^e 2D+1+x2 p l 2C i 2D 1 - > , which implies 


£ w(S)n £ = £ B ff fa]) y(suO-})i. < 


|S|<(l+<5)«* 

&S 


|S|<(l+5)s‘ 

&s 


By letting C 2 = 400Ci, a mathematical induction argument in [15] leads to a bound on the 
second term of (55) that 

£ u(S)I E <P~ C3 , 

|S|<(l+5)s* 

||/3Sc||oo>C 2 /^i 

for some constant C 3 depending on C\,D,\. Moreover, C 3 is increasing with C\. 

Finally, combining the bounds for the four terms in (55), we get 


En 


-01 00 >M 


log p 


n 


Y 


A2 /2 -(f-17) + -C 3 + +e -CVlog 


□ 


< 2e 

< P" C4 , 

for some M, C 4 depending on p, A, D. 

Proof of Theorem 6.2. For Bt, we use ||-|| to denote the £2 norm as ||Br|| = yjTl(i,j)eT lj■ 
Let us first establish (19) and (20). The proof is close to that of Theorem 4.1. By the 
definition of the prior, the posterior distribution has formula 

£r«C T)R(T, U ) 


n (B G U\Y) = 


£ T a(T)i?(T) ’ 


(64) 


where R(T,U) is defined by 
A 


e -hUB T ,0 T c)-B*\\ 2 + (W,(B T ,0 T c)-B*)-\\\B T \\ dB 


' {Bt ,0tc)gU 


R(T) = R(T,W xm ) and 

a{T) = exp (^-D (jr(T)| log + l T l lo S ~~j^~^ 

Moreover, for a set of subsets A, the posterior distribution can be written as 

n (T £ A\Y) = ‘TEmi 

' 1 T. t c[T)R(T) ■ 

We need to give a lower bound for R(T*) with T* = S* x [to] and give upper bounds for 
R(T) and R(T, U). For each subset T, define the following events 


(65) 


E t = { \(W,(B t ,0t°)-B*)\ < 4 Ci m|r(T)| + |r(r)|log 


ep 


|r(T)| 


||(B T ,0 T c) - 5*|| for all B t G 


F t = ||(LF,(5 T ,0 T c) -5*) I < ^Ci [ms* + s*log ||(5 T , 0 T c) - B*\\ for all B t G M |t| j 
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for some constant C\ > 0 to be determined later. A special case of Lemma 7.1 gives 


p (E?) < 2 e _(pCl/ 16 _ 6 ) V m|r(T)l+|r(T)|log FmiJ and P {F?) < 2 e 5 m l r ( T )l-^( ms * +s * log f£). 

( 66 ) 

The same arguments used for deriving (29), (36) and (42) imply 

R(T*) > e — A||B*||-(l+A+A- 1 )m S * ) ( 6 7) 

R(T)I Et < (2A)l T le 2A2_A " S *^ +2C ' 1 ( m ' r ^ ) l + l r(2 ’I 108 77771 ), (68) 

R(T,U)I Ft < ( 2 \/ 2 A) |T| e 2 A 2 " A||i 3 * ll "(i M " 2 C ' 1 )( ms *+ s * lo gf^), (69) 


with U = | \\B — B* || > M (ms* + s* log f?) }• Let A = {|r(T)| > (1 + d)s*}. By the formula 
(65) and the inequalities (67) and ( 68 ), we have 

En(T € A\Y) 

* E^f^ + En^) 


T&A v v ' T&A 

< e (C 2 +D)(ms*+s*logf£) 


E E 


s -(D-C 2 )(ms+s logf) 


E 


= -D|T|log£ff 


s>(l+<5)s* 5:|S|=s 

q_2 y^ ^2 e-( pCl / 16 - 7 ' ) ( ms+sl ° s ^ 


T :r(T)=S 


s>(l+<5)s* S:|S|=s 
< g—C' (ms* +s* log |^) 

for some sufficiently large D with C 2 only depending on C\ and A and C' only depending on 
D,p, A. By the formula (64) and the inequalities (67) and (42), we have 

En(£ g U\Y) 

< y I + y F(FZ) + e-C'(ms* + s* log fg) 

“ ^ a(T*) R(T*) T y TJ 


TgA c v v ' TeA c 

< e -(\M-C3-D)(m,s*+s*log^r) y^ y^ e~ D<ymS+slog ^ e 

s<(l+«5)s* S:|S|=s T:r(T)=S 

+2e - ^6"( ms * +s * log ^) y^ ^2 e 6ms + e~ c '^ ms * +s * l ° s ^ 
s<(l+(5)s* S:|5|=s 

< g-C“(ms‘+s* log pr) 


~D\T\ logoff 


for some sufficiently large M with C 3 only depending on C\ and A and C" only depending 
on D, p, A. Hence, (19) and (20) are proved. 

Now let us proceed to prove (21). We are going to use the similar argument as that of 
Theorem 6.1. Note that the posterior distribution can be represented as 

Enfflr)n T (.|y) = E uj(T)C(Y t A)® 5 t *, 

T T 
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where II(T|Y) = ui(T) and IXr(-|Y) = C(Yr , A) <g> 5t c with 

W ( T ) ^ (^) Q ( r )My, \e^ Yr ^. 

The distribution Bt ~ £(Yt, A) is defined through the density function 

A/ 1 e -|l|BT-V T || 2 -A||B T |! 

Yj 1 ,A 5 

where Afy T x is the normalizing constant dehned in (53). Define the event 


E = < max |Wij| < Ci y/\og(pm) \ 
L(*u)e[p]x[m] J 


for some constant C\ > 0. We have 
Eu(\\B-B*\\ 00 >My/\og(pm) 


Y 


< E Y w(T)n r f\\B t - Yrlloo > i/log (pm) Y^ l E + E ^ w(T)I s 

|r(T)|<(l+(5)s* ' ' |r(T)|<(l+ g)s* 

II Bye II oo >C2 y/\og(pm) 

+P (E c ) + En (|r(T)| > (1 + <5)s*|Y). (70) 


It is sufficient to bound the four terms in (70). The last term is bounded by (19). Using (5) 

and a union bound argument, we bound the third term in (70) as P(E C ) < (pm) A 
Using the same arguments in deriving (57) and (59), we have 


If T ( \\Bt - Yrlloo > —My/\og(pm) 


Y 


< Y exp (-^VlogCpm)) E Ht (e 


y/rit | Bi j ^ ij 


Y 


(ij')er 


< 2e A2 /Ve" ¥ M V l °g(pm)+t 2 < 2e A2/2 (pm)"(f- 2 ) 


by choosing t = y^log (pm). This bounds the first term of (70). Now let us provide a bound 
for the first term of (70). Given some (i,j) € [p] x [m], for any subset T such that (i,j) ^ T, 
use the notation T' = T U {(z,j)}. To facilitate the proof, we need an upper bound for 
a }(T)/u(T') on the event E. Direct calculation gives 


w(T) _ s/tt a(T) My t ,x -ly? 
cu(T') “ A cx(T')M Yt ,/ 


Since < (epm ) 3D , and 


A/y T ,A 

M' T ,, A 


< C' A e A l y «l 


(71) 
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for some constant C\ only depending on A, we have uj(T)/uj(T') < C\^- (epm) 3D e X ^ ij ^ z Y a . 
The inequality (71) will be established in the end of the proof. Since 

< a 2 -1 r, 2 

< a 2 - + ht; 2 < a 2 - 1c 2 iog(pm) 

when \B*-\ > 2C\^\og(pm) on the event E. Hence, 


uj[T) A 


(pm) ( 4^1 3D \ 


(72) 


Let C 2 = 2C\ and define {(7 1 ,. 71 ), (i q , j q )} to be the set such that |i?* -J > 2C\^J\og(pm) 
for all l £ [q]. Then, we have 


I Bj*c 11 00 > C 2 \/log(pm)} C U i e [ q ]{(ii,ji) £ T}, 


which implies 

E 

\r(T)\<(l+ S)s* 

||SJc||cx>>C2v / log(pm) 


“(r) < E E 


u(T) 


ie[q] TefT^iijOm 


uj(TL) {(ii,ji)}) 


uj(TU {(ii,ji)}) 


< 


C^e 3D+x2 (pm)-^ c ?-3D) W (TU {(ipji)}) 

le[q] T&{T-.(i l: mT} 

< (pm)~c 

by (72) for some constant C with sufficiently large C\. Combining the bounds for the four 
terms in (55), we reach the conclusion (21). 

Finally, let us establish (71) to close the proof. By change of variable, we have 

M Yt a = [ \ e-^Mllfell 2 -AV(H+ll^||) 2 +||6 2 || 2 d6ld62 , 

J Rl T l-i J K 

and _ 

My ,,\= / e _ ^C 6 i +6 3) _ 5ll b2 ll 2 - A v / ( fe i+ll y T'll) 2 +ll 6 2|| 2 +fe! dbidb 2 db 3 . 

Therefore, triangle inequality implies 

A/V t „a > My ,a [ e-i b2 - A l 6 ld6e- A lll yr ll-ll y -'lll>C A - 1 e- A l y -l, 

J M. 

where C\ = ^/ R e~^ b ~~ x ^db^ . Thus, the proof is complete. 


□ 


40 










References 

[1] Alekh Agarwal, Animashree Anandkumar, and Praneeth Netrapalli. Exact recovery of 
sparsely used overcomplete dictionaries. arXiv preprint arXiv:1309.1952, 2013. 

[2] David J Aldous. Representations for partially exchangeable arrays of random variables. 
Journal of Multivariate Analysis , 11(4):581—598, 1981. 

[3] Sergey Bakin. Adaptive regression and model selection in data mining problems. 1999. 

[4] Sayantan Banerjee and Subhashis Ghosal. Posterior convergence rates for estimating 
large precision matrices using graphical models. Electronic Journal of Statistics, 8(2): 
2111-2137, 2014. 

[5] Andrew Barron, Mark J Schervish, and Larry Wasserman. The consistency of posterior 
distributions in nonparametric problems. The Annals of Statistics, 2T(2):536—561, 1999. 

[6] Andrew R Barron. The exponential convergence of posterior probabilities with implica¬ 
tions for Bayes estimators of density functions. Univ.of Illinois, 1988. 

[7] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of 
lasso and dantzig selector. The Annals of Statistics, pages 1705-1732, 2009. 

[8] Lucien Birge and Pascal Massart. Gaussian model selection. Journal of the European 
Mathematical Society, 3(3):203-268, 2001. 

[9] Lawrence D Brown and Mark G Low. Asymptotic equivalence of nonparametric regres¬ 
sion and white noise. The Annals of Statistics, 24(6):2384-2398, 1996. 

[10] Peter Biihlmann and Sara A van de Geer. Statistics for high-dimensional data: methods, 
theory and applications. Springer Science & Business Media, 2011. 

[11] Florentina Bunea. Consistent selection via the lasso for high dimensional approximating 
regression models. In Pushing the limits of contemporary statistics: contributions in 
honor of Jayanta K. Ghosh, pages 122 137. Institute of Mathematical Statistics, 2008. 

[12] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Trans¬ 
actions on Information Theory, 51(12):4203-4215, 2005. 

[13] Ismael Castillo. On bayesian supremum norm contraction rates. The Annals of Statistics, 
42(5):2058-2091, 2014. 

[14] Ismael Castillo and Aad W van der Vaart. Needles and straw in a haystack: Posterior 
concentration for possibly sparse sequences. The Annals of Statistics, 40(4):2069-2101, 
2012 . 

[15] Ismael Castillo, Johannes Schmidt-Hieber, and Aad W van der Vaart. Bayesian linear 
regression with sparse priors. The Annals of Statistics (to appear), 2014. 


41 


[16] Olivier Catoni. Statistical learning theory and stochastic optimization. Lecture Notes in 
Mathematics, 1851, 2004. 

[17] Persi Diaconis and Svante Janson. Graph limits and exchangeable random graphs. arXiv 
preprint arXiv:0712.2749, 2007. 

[18] David L Donoho and Iain M Johnstone. Minimax risk over Zp-balls for Z^-error. Probability 
Theory and Related Fields, 99(2):277-303, 1994. 

[19] David L Donoho, Michael Elad, and Vladimir N Temlyakov. Stable recovery of sparse 
overcomplete representations in the presence of noise. IEEE Transactions on Information 
Theory, 52(1):6-18, 2006. 

[20] Kai-Tai Fang, Samuel Kotz, and Kai Wang Ng. Symmetric multivariate and related 
distributions. Chapman and Hall, 1990. 

[21] Chao Gao and Harrison H Zhou. Adaptive bayesian estimation via block prior. arXiv 
preprint arXiv:1312.3937, 2013. 

[22] Chao Gao and Harrison H Zhou. Rate-optimal posterior contraction for sparse pea. The 
Annals of Statistics, 43(2):785-818, 2015. 

[23] Chao Gao, Yu Lu, and Harrison H Zhou. Rate-optimal graphon estimation. arXiv 
preprint arXiv.lflO.5837, 2014. 

[24] Subhashis Ghosal, Jayanta K Ghosh, and RV Ramamoorthi. Posterior consistency of 
dirichlet mixtures in density estimation. The Annals of Statistics, 27(1):143 158, 1999. 

[25] Subhashis Ghosal, Jayanta K Ghosh, and Aad W van der Vaart. Convergence rates of 
posterior distributions. The Annals of Statistics, 28(2):500-531, 2000. 

[26] John A Hartigan. Direct clustering of a data matrix. Journal of the american statistical 
association, 67(337): 123-129, 1972. 

[27] Marc Hoffmann, Judith Rousseau, and Johannes Schmidt-Hieber. On adaptive posterior 
concentration rates. The Annals of Statistics (to appear), 2013. 

[28] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic block- 
models: First steps. Social networks, 5(2): 109-137, 1983. 

[29] Douglas N Hoover. Relations on probability spaces and arrays of random variables. 
Preprint, Institute for Advanced Study, Princeton, NJ, 2, 1979. 

[30] Iain M Johnstone. Gaussian estimation: Sequence and wavelet models. 2011. 

[31] Olav Kallenberg. On the representation theorem for exchangeable arrays. Journal of 
Multivariate Analysis, 30(1):137-154, 1989. 


42 


[32] Bas J K Kleijn and Aad W van der Vaart. Misspecification in infinite-dimensional 
bayesian statistics. The Annals of Statistics, pages 837-877, 2006. 

[33] Gilbert Leung and Andrew R Barron. Information theory and mixing least-squares 
regressions. IEEE Transactions on Information Theory, 52(8):3396-3410, 2006. 

[34] Karim Lounici. Sup-norm convergence rate and sign concentration property of lasso and 
dantzig estimators. Electronic Journal of statistics, 2:90-102, 2008. 

[35] Karim Lounici, Massimiliano Pontil, Sara A van de Geer, and Alexandre B Tsybakov. 
Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 
39(4):2164-2204, 2011. 

[36] Laszlo Lovasz. Large networks and graph limits, volume 60. American Mathematical 
Soc., 2012. 

[37] Laszlo Lovasz and Balazs Szegedy. Limits of dense graph sequences. Journal of Combi¬ 
natorial Theory, Series B, 96(6):933-957, 2006. 

[38] Yu Lu and Harrison H Zhou. Minimax rates for product of three matrices. 2015. 

[39] Zongming Ma and Yihong Wu. Volume ratio, sparsity, and minimaxity under unitarily 
invariant norms. arXiv preprint arXiv:1306.3609, 2013. 

[40] Ryan Martin, Raymond Mess, and Stephen G Walker. Empirical bayes posterior con¬ 
centration in sparse high-dimensional linear models. arXiv preprint arXiv:lf06.7718, 
2014. 

[41] Arkadi Nemirovski. Topics in non-parametric statistics. 2000. 

[42] Michael Nussbaum. Asymptotic equivalence of density estimation and gaussian white 
noise. The Annals of Statistics, pages 2399-2430, 1996. 

[43] Bruno A Olshausen. Emergence of simple-cell receptive field properties by learning a 
sparse code for natural images. Nature, 381 (6583):607-609, 1996. 

[44] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation 
for high-dimensional linear regression over-balls. IEEE Transactions on Information 
Theory, 57(10):6976-6994, 2011. 

[45] Philippe Rigollet and Alexandre B Tsybakov. Exponential screening and optimal rates 
of sparse estimation. The Annals of Statistics, 39(2):731-771, 2011. 

[46] Philippe Rigollet and Alexandre B Tsybakov. Sparse estimation by exponential weight¬ 
ing. Statistical Science, 27(4):558-575, 2012. 

[47] Vincent Rivoirard and Judith Rousseau. Posterior concentration rates for infinite di¬ 
mensional exponential families. Bayesian Analysis, 7(2):311-334, 2012. 


43 


[48] Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior distributions. 
The Annals of Statistics, pages 687-714, 2001. 

[49] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal 
Statistical Society. Series B, 58(l):267-288, 1996. 

[50] Alexandre B Tsybakov. Optimal rates of aggregation. In Learning Theory and Kernel 
Machines, pages 303-313. Springer, 2003. 

[51] Alexandre B Tsybakov. Aggregation and minimax optimality in high-dimensional esti¬ 
mation. In Proceedings of the International Congress of Mathematicians, 2014. 

[52] Sara A van de Geer and Peter Biihlmann. On the conditions used to prove oracle results 
for the lasso. Electronic Journal of Statistics, 3:1360-1392, 2009. 

[53] Aad W van der Vaart and J Harry van Zanten. Rates of contraction of posterior dis¬ 
tributions based on gaussian process priors. The Annals of Statistics, pages 1435-1463, 
2008. 

[54] Nicolas Verzelen. Minimax risks for sparse regressions: Ultra-high dimensional phe¬ 
nomenons. Electronic Journal of Statistics, 6:38-90, 2012. 

[55] Zhan Wang, Sandra Paterlini, Frank Gao, and Yuhong Yang. Adaptive minimax esti¬ 
mation over sparse £ 9 -hulls. arXiv preprint arXiv:1108.1961, 2011. 

[56] Yuhong Yang. Aggregating regression procedures to improve performance. Bernoulli, 
10(l):25-47, 2004. 

[57] Yun Yang and David B Dunson. Minimax optimal bayesian aggregation. arXiv preprint 
arXiv: 1403.1345, 2014. 

[58] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped 
variables. Journal of the Royal Statistical Society: Series B, 68(l):49-67, 2006. 

[59] Cun-Hui Zhang and Jian Huang. The sparsity and bias of the lasso selection in high¬ 
dimensional linear regression. The Annals of Statistics , pages 1567-1594, 2008. 


44 


